info491finalpaper

23
Spark and Storm: Performance and Use Case Comparison Jessica Morris, Partap Singh May 2, 2016 INFO 491: Big Data Analytics

Upload: jessica-morris

Post on 17-Feb-2017

33 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INFO491FinalPaper

Spark and Storm: Performance and Use Case Comparison

Jessica Morris, Partap Singh

May 2, 2016

INFO 491: Big Data Analytics

Page 2: INFO491FinalPaper

EXECUTIVE SUMMARY 2

INTRODUCTION 3

Apache Spark 4

Apache Storm 5

EVALUATION CRITERIA 6

Data Processing 6

Latency 8

Reliability 9

Ease of Use 9

Application of Use 10

CONCLUSIONS 12

Visual Review of Data 12

Final Overview 14

2

Page 3: INFO491FinalPaper

Executive Summary

The purpose of this paper is to explore the differences between the Hadoop

Toolkit technologies Apache Spark and Apache Storm. We will use various online

sources for this comparison in order to understand the real life usage of these

technologies. The report concludes that Apache Spark and Apache Storm each have very

different use cases and their strengths and weaknesses should be carefully considered

when deciding between the two in an overlapping use case. In general; Apache Storm is

stronger at real time stream processing, but Apache Spark is a very strong and fast batch

processing technology. When the situation calls for real time streaming, it is our

recommendation that Apache Storm is used. If the situation is more generalized or calls

for high latency batch processing, Apache spark is more appropriate.

Unlike Hadoop, both Storm and Spark are new and evolving technologies in

business world so a limited amount of information was available at the time this paper

was written. This paper is written by keeping people in mind who are new to this

technology. A simple holistic approach has been used to compare both technologies.

This paper is not meant to be a reference for coding or programming language.

3

Page 4: INFO491FinalPaper

IntroductionIn the world of computing, the necessity of big data processing is more visible

now than ever. Technology such as Apache Hadoop provides frameworks for many

types of big data computing. Two technologies that exist within this framework are

Spark and Storm. Both of these tools can analyze big data in real time. Real time data

processing used to require expensive batch operations with high latencies. As

technologies continue to grow, pushing both the physical and logical limits of

computing, the demand for alternative ways of analyzing big data has risen. Both Spark

and Storm utilize distributed processing and real time processing to improve upon the

implementation of business intelligence analytics, meeting these demands.

Using the following benchmarks, this paper will analyze the use cases and

strengths of Spark and Storm.

Data-Processing

Latency

Reliability

Ease of Use

Applied Use

It should be noted that while Spark and Storm share some use cases and have an

overlap of technologies used, they both also have a unique set of strengths and

weaknesses. This means that there is no clear overall ‘winner’, only certain applications

in which Spark may be more appropriate than Storm or vice versa. In many cases, the

difference boils down to developer or analyst preference.

Both Spark and Storm are essential parts of the Apache Hadoop toolkit. It is

pioneering technologies like Spark and Storm that will pave the future for big data

analytics. Distributed computing and real time data processing are both technologies

with great potential and are very much worth discussing.

4

Page 5: INFO491FinalPaper

Apache Spark

Spark’s official website describes Apache Spark as a “fast and general engine for

large-scale data processing.” Businesses such as Amazon, eBay, Yahoo!, and TripAdvisor

all utilize Spark’s technology. This usage is often for ‘real time’ predictive analytics,

profiling and personalization. Many businesses in the financial, health services, and

entertainment industries make use of Spark as well.

The technology is founded on the concept of distributed datasets, which use

Scala, Java or Python objects. Spark is a flexible framework that can handle both batch

and real time data processing. Spark Streaming adds streaming capabilities similar to

Apache Storm as well, but where Spark truly shines is in its ability to process data files

using an efficient distributed computing process over a large amount of clusters and

nodes.

In many ways, Spark was designed as an improvement on MapReduce. One key

advantage of Spark is that it provides an excellent interface for distributed programming

and cluster programming. Parallel data and fault tolerance are implicit in Spark’s design.

These components were developed as a response to inherent limitations found in

MapReduce. Spark runs on a resilient distributed dataset (RDD), which is read only and

designed for distribution across clusters. The result is a much more natural way of

processing data compared to other tools. Some inherent advantages to Spark are:

Fault-tolerance

Scalability- can run on clusters with thousands of nodes

Speedy and efficient batch-processing- 10x – 100x (in memory) faster than

MapReduce

Flexibility

Older technologies such as MapReduce broke up data into smaller blocks for

easier processing. Contrastingly, Spark can process the entire job in memory allowing

for better parallel form. Spark works well with many of the major programming

languages used in data analysis such as Python, Java, and Scala -- Sparks’ native

programming language.

5

Page 6: INFO491FinalPaper

Spark is built to process continuous query style workloads. Because Spark was

designed for ease of use and efficiency of processing, it is a market leader for streaming

technologies, beating out many other more specialized frameworks. Spark was built to

be a generalized way of processing data, making it inherently flexible – it can be applied

to anything from distributed data analytics to machine learning.

Hive, Pig, Sqoop and many other Hadoop ecosystem tools work well with Spark.

In many cases the combination of these tools can replace MapReduce completely. Given

its unique approach to analytics and synergy with many existing applications,

programming languages, and framework, Spark is a rich addition to the Hadoop toolkit.

Spark is a robust batch processing framework with the added capability of Micro-

Batching, as well as extensions such as Spark Streaming.

Apache Storm

Storm is a free and open source distributed real time computation system for

processing large volume of high volume (Streaming) data in parallel and at scale. Unlike

Hadoop and MapReduce which process data in batches, storm process unbounded

streams of data in real time. Being simple and easy to use with any programming

language, many developers find Storm fun to use.

According to Hortonworks, there are five characteristics that make storm ideal

for real-time data processing workloads.

Fast - it can process one million 100 byte messages per second per node

Scalable- parallel calculations that run across a cluster of machines

Fault-tolerant- If a node dies, the system will be restarted on another node.

Reliable- storm guarantees that each unit of data (tuple) will be processed at

least once or exactly once.

For many of the business opportunities and use cases where storm can be

beneficial are: real-time customer service management, data monetization, operational

dashboards, cyber security analytics, threat detection, online machine learning,

continuous computation, distributed RPC, ETL and more.

6

Page 7: INFO491FinalPaper

Many well known and established businesses are using Apache Storm, including:

Groupon, an online coupon based company that uses storm to build real-time data

integration systems. Storm helps Groupon to analyze, clean, normalize and resolve large

amounts of non-unique data points with low latency and high throughput. The Weather

Channel uses many Storm topologies to ingest and persist weather data. Twitter’s

system uses a wide variety of Storm in application from discovery, real-time analytics,

personalization, search, revenue optimization and many more.

Evaluation CriteriaData Processing

Apache Spark is designed to do more than plain data processing. It actually can

make use of existing machine learning libraries and process graphs. Due to the high

performance of Apache Spark, it can be used for both batch processing and real time

processing. Strategic developer Mr. Oliver in a blog Storm or Spark compared some of

the data processing of both technologies in Dec 2014. According to him, both

technologies have overlapping capabilities and distinctive features and roles to play.

In 2011, BackType, a marketing intelligence company bought by Twitter brought

Storm as a project to distributed computation framework for event stream processing.

Twitter soon open-sourced the project and put it on GitHub, but Storm ultimately

moved to Apache Incubator and became an Apache top-level project in September

2014. Referred as the Hadoop of real-time processing, Storm appears to process

unbounded streams of data for real-time processing what Hadoop did for batch

processing.

Storm is written primarily in Clojure and is designed to support wiring “spouts”

(input streams) and “bolts” (processing and output modules) together as a directed

acyclic graph (DAG) called a topology. Storm topologies run on clusters and based on the

topology configuration, Storm scheduler distributes work to nodes around the clusters.

MapReduce in Hadoop and topologies in Storm roughly do the same job except Storm

7

Page 8: INFO491FinalPaper

focus on real-time, stream-based processing. Topologies default to run forever or until

manually terminated. Once a topology is started, the spouts bring data into the system

and hand the data off to bolts where the main computational work is done. As

processing progresses, one or more bolts may requires to write data out to a database

or file system or send a message to another external system.

One of the strengths of the Storm ecosystem is a rich array of available spouts

which are specialized for receiving data from all types of sources. Processed data in

Storm is easy to integrate with HDFS file systems. Even though Storm is based on

Clojure, it can also support multilanguage programming like JVM.

However, starting out as a project of AMPLab at the University of California at

Berkeley, Spark got very popular project to real-time distributed computation and

ultimately became as a top level project of Apache in February. Like Storm, Spark

supports Stream-Oriented processing, but it’s more of a general-purpose distributed

computing platform. While Spark has the ability to run on top of an existing Hadoop

cluster, relying on YARN for resource scheduling, it can be seen as a potential

replacement for MapReduce functions of Hadoop. Written in Scala, Spark supports

multilanguage programming. Spark provides specific API support only for Scala, Java and

Python. Spark does not have the specific abstraction of a “spout”, but it includes

adapters for working with data stored in numerous disparate sources, including HDFS

files, Cassandra, Hbase, and S3. Like Storm, Spark supports a streaming model, but this

support is provided by only one of several Spark modules, including purpose-built

modules for SQL access, graph operations, and machine learning, along with stream

processing.

Spark does not process streams one at a time like storm. Instead, it slices them in

small batches of time intervals before processing them. DStream, the abstraction for a

continuous stream of data is a micro-batch of RDDs (Resilient Distributed Datasets) are

distributed collections that can be operated in parallel by arbitrary functions and by

transformations over a sliding window of data (windowed computations).

8

Page 9: INFO491FinalPaper

Latency

Traditional data warehousing environments were expensive and had high latency

towards batch operations. As a result of this, organizations were not able to embrace

the power of real time business intelligence and big data analytics in real time. Today

many powerful open-source tools like Spark and Storm have emerged to overcome this

challenge. Spark is a parallel open source processing framework and its workflows are

designed in Hadoop MapReduce but are comparatively more efficient than Hadoop

MapReduce. One of the benefit Apache Spark has is that it does not use Hadoop YARN

for functioning but it has its own streaming API and independent processes for

continuous batch processing across varying short time intervals. Spark runs 100 times

faster than Hadoop in certain situations, but Spark doesn’t have its own distributed

storage system.

When talking about Spark performance for data processing, we can not ignore

Hadoop MapReduce technology because both technologies work hand in hand . Where

spark processes in-memory data, Hadoop MapReduce persist back to the disk after a

map action or a reduce action. Therefore Hadoop MapReduce lags behind when

compared to Spark. Spark requires huge memory as it loads the process into the

memory and store it for caching. Spark streams events in small batches that come in

short time windows before it processes them whereas Storm processes the events one

at a time. Thus, Spark has a latency of few seconds whereas Storm processes an event

with just millisecond latency.

Reliability

Spark and its RDD abstraction is designed to seamlessly handle failures of any

worker nodes in the cluster. Spark streaming is built on Spark and so to say it enjoys

same fault -tolerance for worker nodes but certain applications likes Kafka and flume

has to recover from failure of the driver process. For sources like files, periodically

computation on every micro-batch of data was sufficient to ensure zero data loss

9

Page 10: INFO491FinalPaper

because it stores data in a fault-tolerant file system like HDFS. However, for the other

sources like Kafka and Flume, some of the received data that was buffered in memory

but not yet processed could get lost. To avoid this data loss, write ahead logs system

was introduced in Spark Streaming 1.2 release.

Storm guarantees that every spout tuple will be fully processed by the topology

by tracking the tree of tuples triggered by every spout tuple and determining when that

tree of tuples has been successfully completed. Every topology has a “message timeout”

associated with it and if Storm fails to detect that a spout tuple has been completed

within that timeout, then it fails the tuple and replays it later. There are two main things

that need to be done to benefit from Storm reliability capabilities. First, it needs to tell

Storm whenever a new link in the tree of tuples to be created and second, it needs to

tell Storm that individual tuple have been finished processing. By doing both of these

things, Storm can detect when the tree of tuples is fully processed and can acknowledge

or fail the spout tuple appropriately.

Ease of Use

Both Apache Spark and Apache Storm were built on a founding principle of user

accessibility. As such, both platforms are scalable, fault-tolerant and support a wide

array of programming languages and platforms of use such as Apache Hadoop.

However, there are some differences.

For example, when developing for Spark, one must keep its base language, Scala

in mind. Scala is a JVM based language designed to improve upon the shortcomings of

Java. While Scala has been praised for its various features and adjustments, such as

scalability (thus, the name Scala), it is not as well known as Java. This can make it

difficult for new developers who are not familiar with the language. Furthermore,

Spark’s Tuples are also based on Scala, and while they can be implemented in Java, this

is known as a difficult process.

There are aspects of Storm that has made it a favorite among some developers.

Storm supports more programming languages than Spark. In addition to Java, Scala and

10

Page 11: INFO491FinalPaper

Python, Core Storm also supports Clojure, Ruby and many others. Because of this

expanded support, Strom’s Tuples can be much easier for programmers to manage.

Storm uses DAGs (directed acyclic graphs) allowing for a very natural flow of data

through Strom’s tuples. Each individual node plays a role in transforming the data

through the DAG. However, this can cause compiling to take a while, and sometimes

causes developers to skip important safety checks.

Conversely, there are many features of Spark that are much easier to implement

compared to other platforms. Spark Streaming allows for Language Integrated API,

which allows developers to write real-time jobs in a very similar manner to batch jobs.

Many developers make great use of Spark’s Interactive Mode which eases the

programming process. While many developers prefer Storm, developers who are

comfortable with Interactive Mode will choose Spark as nothing comparable is available

for Storm at this time. Thus, even with the aforementioned challenges, Spark remains

one of the simpler solutions to the complex problem of real time data.

Application of Use

Apache Spark and Storm are both powerful business intelligence platforms with

a wide array of uses. These platforms were developed for use in processing real time

data, however; the nuances of their application are many.

For example, since Storm approaches real-time data in a micro-batched fashion,

this makes Storm great for real time log monitoring. Twitter originally developed Storm

in 2011 for use in analytics. At Twitter, Storm treats clusters of tweets as a batch and,

working with other tools such as Cassandra, analyzes each batch of tweets for data such

as geographic location, user demographics and determining whether the tweet was

spam. Other companies such as Spotify make use of Strom’s real time strengths to

recommend users music based on listening history or target advertisements to specific

demographics. According to Spotify: “Storm enables us to build low-latency fault-

tolerant distributed systems with ease.”

Many industry professionals see Spark as a more general-purpose platform.

11

Page 12: INFO491FinalPaper

While Storm excels at analyzing user, log and stream data in real time, Spark is more of a

real time improvement on MapReduce. Being a much newer technology, Apache Spark

is far less adapted in the industry and there are fewer working use-cases. Spark shines

when it runs on top of an existing Hadoop cluster or layered on top of other technology

such as Mesos or supporting multiple processing paradigms. While Spark has a

specialized streaming application in Spark Streaming, compared to Storm it is not yet

well supported. Spark is not designed to handle a multi-user environment and all users

have to apportion data appropriately. If a project has many users, Storm may be a more

appropriate choice.

This does not mean that Spark is unused. Contrary, Spark has seen a rapid

adaption since its introduction to the Apache Toolkit in 2014. In addition to streaming,

Spark shows great potential in the areas of machine learning, interactive analysis, and

fog computing. Uber uses Spark to convert unstructured data into structured data, and

Netflix uses Spark in a manner similar to how Spotify uses Storm -- to create real time

user prediction models and recommend movies based on user history.

Overall, while Storm is the older and more tested technology, there is a wide

range of use cases where either Spark or Storm are appropriate. Furthermore, the

decision on whether to use Spark or Storm for a task highly depends on the user base,

organizational structure, and what users wish to accomplish in their data processing.

ConclusionsVisual Review of Data

The following table contains a simplified review of the differences between Spark

and Storm. This table is intended to help establish appropriate use cases for Spark and

Storm as well as help potential users decide between one or both technologies.

12

Page 13: INFO491FinalPaper

Storm Spark

1) Data

processing

Mainly does Streaming

mostly but does Micro-

batching as well ( Trident

API)

Stream primitive is tuple

Spouts (Input Streams),

bolts(Processing and

output modules),

Topologies

Directed Acyclic Graph

(DAG)- Topology

Will run in a continuous

loop unless aborted

Strong point (rich array of

available spouts- to collect

data)

Uses Apache

Zookeeper/minion worker

process to coordinate

topologies

Based on Clojure, but

supports many languages

like Java, Scala, Python,

Ruby, and more

Mainly Batch

processing but also

does micro-batching

Uses DStream, an

abstraction for a

continuous stream of

data is a micro-batch of

RDD (Resilient

Distributed Datasets)

RDD operated in

parallel

Window computations

Hadoop HDFS

dependent ( to store

data after computation)

Based on Java, Scala,

and Python only

2) Speed/

Performance

Storm is fast: a benchmark

clocked it at over a million

tuples processed per

Much faster than

Hadoop MapReduce

(100 times)

13

Page 14: INFO491FinalPaper

second per node.

Claimed to be 4 to 5 times

slower than Spark using

Word Count benchmark

Limited performance per

server by Stream

processing standards

Process data in-

memory and require

huge memory

Faster than Storm but

with limited

performance

3) Reliability Storm guarantees that Tree

of tuple will be fully

processed

Use time out system: every

data is time stamped and

acknowledge system is

used to detect any failure

Process data in

memory, hence volatile

Use log system to keep

track of data process

and to avoid data loss

4) Ease of Use Built on Clojure but but

flexible enough to work

with many other languages,

including Java, Scala and

Python, making it very easy

to work with

Data processing flow is

natural and intuitive

( Spout collects the stream

of data and bolt process it

through HDFS)

Intermediate to advanced

level programmer can use

utilize this technology.

Built on Scala, Java and

Python only.

Need experts of Java

and Scala language to

utilize the technology,

unless GUI is used

GUI is very beginner

friendly and easy to

learn for those who are

new to programming

14

Page 15: INFO491FinalPaper

5) Application

of use

Examples: Twitter,

Groupon, Google

Strong real time data

processing capabilities

Data streaming and

analytics

Event log monitoring,

record keeping, filtering,

lookup

Examples: Uber, Netflix,

Pintrest

Batching and micro-

batching

Distributed computing

Machine learning

Event detection

Cluster computing

Data analytics

Final Overview

Spark and Storm are very capable technologies in real- time analytics. Both of

the framework are on a streaming architecture, meaning there is an inbound, infinite list

of tuples. In this report, Spark and Storm has been evaluated on base of data processing,

speed, reliability, ease of use, and application of use. In Storm nomenclature, finite-size

tuples is processed through spouts and bolts. In Spark nomenclature, the stream is

called a D-Stream or a discretized stream. In Spark Streaming framework, data

processing engine essentially discretize and convert the stream into finite size RDDs,

which are essentially micro-batches to process messages. Both of the framework allow

stateful processing, that is if a worker node or driver node lose data, it can be recover

from the state. Resilient Distributed Dataset(RDD) is cornerstone of Spark that entails

finite-sized immutable data structures which can be replayed as needed and are

replicated over HDFS or any type of persistent file system.

Since both technologies have varying strengths and weaknesses, it is hard to

recommend one over the other in a general sense. However, Spark and Storm have a

large application domain for which specific uses may be appropriate. In general, industry

professionals agree that Storm is much more useful as a real time streaming technology

at this point, even though Spark Streaming is growing by leaps and bounds. Storm has a

specialized toolset that allows it strong processing capabilities but Spark is very

15

Page 16: INFO491FinalPaper

generalized and the technology has developed in such a way that it may replace

MapReduce in the future. Although the use cases for Spark and Storm may have some

overlap, the platforms are very distinct. If one desires to use stream-based technology

for analytic purposes, provided that they are comfortable with programming, Storm is

the better choice. If one would like to use cluster computing or batching technology, or

they are uncomfortable with programming, than Spark is the better choice. Overall,

both Spark and Storm are fantastic ways of achieving lighting- fast, reliable and easy-to-

learn data processing.

16