Download - COLL Report Typesafe Apache Spark
-
APACHE SPARKPREPARING FOR THE NEXT WAVE OF REACTIVE BIG DATA
-
2Foreword..........................................................................................................................................................3
Apache Spark Survey 2015 - Quick Snapshot .................................................................................................4
INTRODUCTION: Is Apache Spark the Future in Reactive Big Data? ................................. 5
CHAPTER 2: The People and Organizations Interested in Apache Spark ........................ 7
CHAPTER 3: What Goals Do Organizations Hope to Achieve with Apache Spark? ....... 10
CHAPTER 4: How Organizations Use Spark Today ........................................................... 15
CHAPTER 5: Barriers, Concerns and Support Desires Expressed by Respondents ...... 19
Final Thoughts ...................................................................................................................... 22
CONTENTS
-
3FOREWORD BY MATEI ZAHARIA, CREATOR OF APACHE SPARK
Im very excited to see this survey, built with Typesafe, that represents the largest poll of Spark developers yet. Apache Spark has rapidly been gaining traction over the past few years, and Im thrilled to see the wide variety of use cases and environments where it is being deployed. This survey of over 2100 developers alone highlights that over 500 enterprises using or planning to use Spark in production in 2015, in environments ranging from Hadoop clusters to public and private clouds, with data sources including key-value stores, databases, stream-ing data and file systems. Their use cases range from batch workloads to SQL queries, stream processing and machine learning, highlighting Sparks unique capability as a simple, unified platform for data processing.
At Databricks and within the Spark community, this type of feedback is critical in helping us continue to enhance Spark for many more use cases and make Big Data simpler for enterprises of all sizes.
Matei Zaharia CTO at Databricks and Vice President, Apache Spark @matei_zaharia
-
74% Developers8% Data Scientists7% C-level execs
TOP 3 LANGUAGES USED WITH SPARK
88% Scala 44% Java22% Python
31% are evaluating Spark now
are running Spark in production
13%
82% of users chose Spark to replace MapReduce
78% of users need faster processing of larger data sets
62% of users load data into Spark with Hadoop DFS
54% of users run Spark standalone
67% of users need Spark for event stream processing
20% are planning to use Spark in 2015
TOP 3 INDUSTRIES
RESPONDENTS
Telecoms, Banks, Retail
APACHE SPARK SURVEY 2015 - QUICK SNAPSHOT
-
CHAPTER 1: INTRODUCTION
Is Apache Spark the Future in Reactive Big Data?
-
6INTRODUCTIONBack in summer of 2014, we launched the results of a survey on Java 8, which provided us a lot of information we were looking for but also contained a small, golden nugget of data that we didnt expect: that out of more than 3000 developers surveyed, a shocking 17% of them reported using Apache Spark in production. Whoa.
Apache Spark is a fast and general engine for large-scale data processing built using Scala and Akka, two technologies among many that we at Typesafe recommend for building Reactive systems. Notice that fast is emphasized in the Spark description? As weve learned, its actually not the size, but rather the speed or velocity of the data that is the challenge. So why Scala and Akka, you ask? You can refer to this posting by Matei for his full answer.
With this foundation in mind, it made a lot of sense to learn more. So we asked a total of 2136 respondents about Spark awareness and adoption, the most-demanded features/modules, and how organizations use Spark in production today. We partnered with Databricks (also founded by Matei) in order to bring full lifecycle support for Apache Spark to Typesafe customers.
We think of this next phase of technology as Reactive Big Data. But whatever you call it, its already here.
When we started Spark, we had two goalswe wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsofts DryadLINQ (the first language-integrated Big Data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scalas static typing also made it much easier to control per-formance compared to, say, Jython or Groovy.
Matei Zaharia CTO at Databricks and Vice President, Apache Spark @matei_zaharia
-
CHAPTER 2: WHO IS GETTING FIRED UP OVER SPARK?
The People and Organizations Interested in Apache Spark
-
8WHAT BEST DESCRIBES YOUR ROLE?The respondents who joined our survey generally adhere to the common technology industry demographics: a vast majority of software developers (74%) along with a smattering of other professionals. However, rather than having a more sizeable segment of Architects (3.5%), we can see higher representation of Data Scientists (7.5%), C-level Executives (6.5%), clearly speaking to the ripple effect that Big Data has across an organization.
The industry verticals in which respondents place themselves are fairly varied. The largest consumersTelcos (16%), Banks (12%), Retailers (11%), Software/Tech (10%) and Advertising (9%)are all huge consumers of complex data sets, plus their business models often depend on crunching real-time data for reactive decision making at times of peak traffic/usage.
JOB TYPE/ROLE INDUSTRY FOCUS
7.5% Data Scientist 6.5% C-Level Executive 3.5% Soware Architect 3.5% Dev Ops 1% Business Analyst
74% Developer
6.5% Other
33% Other
5% Consulting
4% Healthcare / Insurance
9% Advertising
10% Soware / Technology
11% Retail
12% Banking / Finance
16% Telecommunications / Networks
Including Biotechnology/Chemistry, Machinery, Education, Government and Utilities and other sectors
-
9WHICH OF THE FOLLOWING TECHNOLOGIES DO YOU USE FOR YOUR PRODUCTION INFRASTRUCTURE?
We see quite a lot of complementary technologies in this breakdown of production infrastructure toolsfrom IaaS/PaaS to frameworks and containers. The market has settled on Amazon EC2 (53%), with Docker (34%) and Cloudera CDH (22%) also retaining good market shares. From relative obscurity just 2 years ago, its interesting to see multi-functional Ansible (16%) appear in the mix. Mesos (14%) and OpenStack (13%) havent always been so close in market share, so its curious to see where things will head in 2015-16.
In the end, we are receiving self-reported statistics from a sample population that includes mainly developers, so its not always clear if this question was interpreted as have you ever seen this technology appear in your organization in any form? as opposed to confirmed instances of enterprise-wide production usage.
INFRASTRUCTURE TECHNOLOGIES IN USE
53% Amazon EC2
34% Docker
22% Cloudera CDH
16% Ansible
14% Mesos
13% OpenStack
12% Apache.org Builds of Hadoop
10% HortonWorks HDP
10% Heroku
8% Google Compute Engine
7% Core OS
7% MapR Hadoop Distribution
6% Microso Azure
5% Marathon
4% Kubernetes
2% Aurora
11% Other XaaS
-
CHAPTER 3: A NEW HOPE
What Goals Do Organizations Hope to Achieve with Apache Spark?
-
11
WHICH BEST DESCRIBES YOUR COMPANYS INTEREST (OR AWARENESS) WITH SPARK?
A solid majority representing 72% of respondents have at least some experience with Apache Spark, and a total of 35% are currently using or planning to use it this year (or next). Notably, the largest single segment (31%) is currently evaluating Spark, but since 28% had never heard of Spark at the time of this survey (funnily, this group is now 0%!), there is still a ways to go. But trends can be discernedboth in buzz and adoptionfrom sources as varied as this survey as well as Google Trends:
That said, a similar linear trend exists for searches like Hadoop and Big Data, so while Spark might defeat Hadoop in the processing power and event streaming areas, it is also designed to cooperate very well with Hadoopboth are Apache Foundation projects, after all. This is no secret; the creators of Spark, who later founded Databricks, speak directly to the complementa-ry relationship between Hadoop and Spark in a January 2014 blog post.
Evaluating Spark now
Currently usingin production
Evaluated,not planning to use
Evaluated, will use in 2016 or later
Um, whats Spark?
Planning touse in 2015
31%
28%
20%
13%
6% 2%
CURRENT RELATIONSHIP WITH SPARK
2011 2013
GOOGLE TRENDS - APACHE SPARK INTEREST OVER TIME
-
12
Fast Batch Processing of
Large Data Sets
78%Support for
Event Stream Processing
60%Fast Data Queries in Real Time
56%Improved
Programmer Productivity
55%
WHAT PROBLEMS ARE YOU TRYING TO SOLVE WITH SPARK THAT OTHER TOOLS DONT SOLVE?
The most prevalent goals to achieve by respondents focus on the gains in processing speed, which are indeed one of the most exciting benchmarks: recent Spark in-memory performance tests showed it could process data at up to 100x the speed of Hadoop. However, users are also excited to implement event stream processing, which was an impossibility using previous technologies. As Typesafe CTO Jonas Bonr explains in his 2015 tech trends article in Wired.com, its the velocity of data that concerns most organizations, not the size.
Jonas Bonr CTO, Typesafe @jboner
Most so-called Big Data problems today are actually better described in the context of velocity instead of size. You want Fast Data. Speed is the problem to solve, not size.
BUSINESS GOALS IN MIND
-
13
WHICH OF THE FOLLOWING SPARK FEATURES OR MODULES ARE MOST LIKELY TO SOLVE YOUR BIG DATA CHALLENGES?
As you can predict, Spark Core API replacement (82%) and to a lesser extent Spark Streaming (65%) are seen as the biggest benefits of adoption, highlighting the shortcomings of MapReduce in terms of API friendliness, sheer performance and event streaming. Sparks MLlib (59%) and SparkSQL (51%) modules are smaller priorities and GraphX (25%) seems like a distant goal for most.
SPARK FEATURES/MODULES IN DEMAND
25%
59%65%82%
51%
Core API as a Replacement for
MapReduceStreaming Library(Spark Streaming)
Machine Learning Library
(MLlib) Integrated SQL (SparkSQL)
Graph Algorithms Library
(GraphX)
Spark uses sophisticated caching of intermediate data in memory between processing steps, considerably improv-ing the performance of applications compared to comparable MapReduce implementations. Compared to the MapReduce API, the Spark API is amazingly intuitive, providing concise, expressive operations that are often needed for analytics. So, in addition to addressing a wider class of problems, Spark is improving the productivity of developers who use it.
Dean Wampler Author & Big Data Expert, Typesafe @deanwampler
-
14
HOW WILL YOU USE SPARK TO PROCESS YOUR DATA?
When it comes to data sources used by Spark, there is a reasonable amount of variance. Event stream processing (67%), clearly a priority, remains a focus for over two-thirds of respondentsa further breakdown of this aspect is presented on this page. The rest of these priorities are speaking to current legacy systems; developers will use Spark as a replacement for MapReduce in traditional batch mode applications, including ETL (61%) jobs for moving, cleaning, and re-format-ting data sets, and this will affect the rest of data processing methods as well.
Many respondents feel that event stream processing will be a key killer feature of Spark, and see it helping their entire data pipeline (71%) as a whole, which points to the idea of extracting data sooner rather than later (65%); seems to encourage the evolution towards Reactive systems with Big Data at the heart of it all. Decision making automation at runtime (which sounds a bit to us like continuous deployment) is also something that about 40% of respondents consider as data velocity increases.
DATA PROCESSING WITH SPARK
39%
41%
46%
46%
59%
61%
Read or Write Data to One or More Databases
Static Reports
SQL Queries and Business Intelligence
Write Data to Hadoop Distributed File System (HDFS)
Ad-hoc Queries and Reporting
ETL Data from External Sources
67% Event Stream Processing
71%
65%
40%
Use Spark as Part of a Larger Data Pipeline
Extract Information from Data Sooner Rather than Later
Automate Decision Making at Runtime
-
CHAPTER 4: APACHE SPARK IN USE
How Organizations Use Spark Today
-
16
2ndJava 44%
1stScala 88%
3rdPython 22%
WHICH PROGRAMMING LANGUAGES ARE IMPORTANT TO YOUR SPARK INSTALLATION?
Considering that Apache Spark was designed with Scala and Akka, its not surprising that the earliest users of this technol-ogy would be focused on Scala (88%). That said, as Spark adoption goes more mainstream on the JVM, we expect Java (44%) to increase in priority over time. Python (22%) is represented by about one-quarter of users, and is the 3rd language after Scala and Java that Spark documentation has prioritized. Other languages that users would like to see supported include R, loved by data scientists and statisti-cians, plus Clojure, Groovy, Ruby and Go.
WHICH LANGUAGES ARE IMPORTANT TO YOUR SPARK INSTALLATION?
Honorable mentions: R, Clojure, Groovy, Ruby & Go
-
17
WHERE ARE YOU RUNNING SPARK CURRENTLY?
Standalone (54%) and Local mode (29%) installations of Spark seem logical for early users with different testing purposes, and one can always add to a cluster later. Otherwise, YARN (42%), aka MapReduce 2, and Mesos (26%) are the general go-to choices for integrating and running Spark with current systems. Cassandra (20%) is another Apache project that not only integrates well with Sparks event streaming power, but shares a similar vision of supporting highly responsive, resilient, elastic systems. Also mentioned by about 3% of respondents is Amazon Elastic MapReduce.
WHERE DO YOU RUN SPARK?
20%29%
42%54%
26%
Standalone
YARN
Local ModeMesos
Cassandra
-
18
HOW DO YOU LOAD YOUR DATA INTO SPARK?
HOW DO YOU LOAD DATA INTO SPARK?
62% Hadoop Distributed File System (HDFS)
18% Other Services(e.g. over socket connection)
41% Apache Kafka
46% Databases
29% Amazon S3
12% Other*
When it comes to data loading, respondents take from a wide spectrum of technologiesfrom DBs to messaging and file systems to plain socket connections, almost anything goes. The winner here is HDFS (62%) which makes perfect sensethe things users cannot get done with Hadoop are designed to be ported over to Spark to finish the job, again emphasizing the complementary nature of these two technologies. Unspecific Databases (46%) are in use by almost half of respondents, and Apache Kafka (41%) is a hot messaging broker built by LinkedIn using Scala in 2011 that now leverages Sparks event streaming capabilities. Amazon S3 comes in at 29%, little surprise considering Amazons infrastructure dominance with EC2 and their fairly comprehensive stack portfolio. *Including:
Apache Cassandra, Amazon Kinesis and Apache HBase
-
CHAPTER 5: SO WHATS THE DELAY IN ADOPTION?
Barriers, Concerns and Support Desires Expressed by Respondents
-
20
WHAT IS YOUR BIGGEST BARRIER TO USING SPARK EFFECTIVELY?
Here we get to analyze hundreds of write-in answers by hand...fun! We found the write-in answers to be generally legible and only occasionally off-topic mumbo jumbo (i.e. some-thing about tabs vs. spaces). We asked about barriers to using Spark effectively at this time, then manually clustered them into sentiment categories, if you will.
Low awareness / experience makes sense, since Spark adoption is still growinga year from now, we pre-dict that awareness of Spark will be considerably higher and no longer considered a barrier to adoption or use.
Current requirements dont fit reflect a lack of urgency among the majority of enterprises; however,
since the data shows that most early adopters use Spark to replace MapReduce, this group will likely re-evaluate
their requirements as the need for data velocity increases.
Too immature regarding integrations with middleware, platforms, tooling and programming
languages. As adoption increases, you should check the Spark pages regularly for updates on feature
and API maturity.
LARGEST BARRIERS TO USING SPARK EFFECTIVELY
Low Awareness / Experience
1st1st
Current Requirements Dont Fit
2nd2nd
TooImmature
3rd3rd
-
21
HOW CAN SUPPORT BE IMPROVED?
In line with the previous question, we also had a large collection of suggestions for improving support. Generally, these mirror the issues perceived as barriers to using Spark effectively in the previous question, but with some slight differences in semantics. Here are the top 3 sentiment categories that we hope can serve as useful feedback for future Spark development.
Integration integration integration! comes in loudly as a definite requirement for many users, some of
which may not be aware of currently supported technologies, since they specifically mentioned Scala,
Java and Hadoop, which are first-class citizens for Spark.
Deeper examples, docs & tutorials are important for making the case for Spark. We see documentation,
more real-life case studies and tutorial options (like these) from vendors as answering these needs.
Maturity through features is the final area where respondents see a lot of room to improve. Specifically
mentioned are immaturity in the Spark feature set related to the client and streaming functionality, issues
related to clustering and the overall stability of Spark in production.
HOW CAN SUPPORT BE IMPROVED?
1st1stIntegration
Integration Integration!Integration
Integration Integration!
2nd2ndDeeper Examples, Docs & Tutorials
Deeper Examples, Docs & Tutorials
3rd3rdMaturity
Through FeaturesMaturity
Through Features
-
Final Thoughts Spark has become the Big Data tool of choice for a future of Reactive Systems, fueled by organizations in need of faster data and event steaming features.
-
23
FINAL THOUGHTS
By this point, were sure you now understand that Spark awareness and adoption are experiencing remarkable growth. Developers have a pent-up need to eliminate issues with MapReduce, such as a difficult API, poor performance, and restriction to batch jobs only.
You should consider Spark as the tool that meets these needs, providing excellent performance at scale, a concise and intuitive API, and support for event stream processing and iterative algorithms.
Spark is less mature than older technologies, like MapReduce, so developers also need good documentation, example applications, and guidance on runtime performance tuning, management and monitoring. Spark is also driving interest in Scala, the language in which Spark is written, but developers and data scientists can also use Java, Python, and soon, R.
Its all very good, more or less. So if you, like our sensible PR team, were looking for the Top 3 Takeways From This Survey, here they are in more shareable form:
Spark awareness and adoption are seeing exponential growth.
Google Trends confirms this and the survey shows that 72% of respondents have at least evaluation or research experience with Spark35% are using it or have decided to implement it.
Faster data processing and event streaming are the focus for enterprises.
By far the most desirable features are Sparks vastly improved processing power over MapReduce (over 78% mention this) and the ability to process event streams (over 66% mention this), a limitation of current technologies.
Perceived barriers to adoption are not major blockers.
When asked, respondents mentioned lack of in-house experience and perceived immaturity of some Spark components and integrations with other middleware and management tools. Also cited are needs for better commercial support options and for more comprehensive documentation and advanced examples.
-
24
DONT WORRY...WE HAVE MORE FOR YOU HERE
Typesafe (Twitter: @Typesafe) is dedicated to helping developers build Reactive applications on the JVM. Backed by Greylock Partners, Shasta Ventures, Bain Capital Ventures and Juniper Networks, Typesafe is headquartered in San Francisco with offices in Switzerland and Sweden. To start building Reactive applications today, download Typesafe Activator.
2015 Typesafe
Introducing the Typesafe Reactive Platform
DOWNLOAD
Hands-on Spark Workshop with Typesafe Activator
DOWNLOAD
Getting Started with Spark
DOWNLOAD
Foreword by Matei Zaharia, creator of Apache SparkApache Spark Survey 2015 - Quick Snapshot
Is Apache Spark the Future in Reactive Big Data?The People and Organizations Interested in Apache SparkWhat Goals Do Organizations Hope to Achieve with Apache Spark? How Organizations Use Spark Today Barriers, Concerns and Support Desires Expressed by Respondents Final Thoughts