cascading 2015 user survey results
TRANSCRIPT
Confidential
WHAT’S BEHIND THE RISE OF CASCADING?Enterprise IT teams designing their big data platforms must choose from a daunting array of development frameworks and compute fabrics. On the one hand, they want a development framework that leverages existing skillsets. At the same time, they want the flexibility to benefit from performance gains of the latest, greatest compute fabrics.
Cascading is a robust framework with over 10,000 known production deployments, over 275,000 downloads per month. Twitter, AirBnB, Climate Corp, Apple, EBay, Netflix, are examples of few of the enterprises that have built their Hadoop practices with Cascading. The Cascading user group is diverse, self-‐supporting community who are helping innovate Cascading’s scalability, portability, performance and value. In addition, the presence of a large number of open source projects contributed by mainstream enterprises such as by Netflix, Commonwealth Bank of Australia, Expedia attests to vibrancy of the Cascading ecosystem.
In this paper, we'll reveal what’s behind Cascading's growth by digging into the results of a new Cascading user survey. In general, Cascading users turn out to be extremely concerned about reliability and performance at scale. Many experimented with early Hadoop frameworks like Hive and Pig, but found Cascading to be a more scalable approach. And lately, the easy portability of Cascading applications between compute fabrics has generated a lot of excitement in the community.
Confidential
0 10 20 30 40 50 60 70Head/VP of IT
Head of IT InfrastructureApplication Manager/Director
BI/EDW Manager/DirectorCIO/SVP of IT
IT SpecialistArchitect
IT Manager or DirectorDeveloper/Engineer
What title best describes your role?
N=121 Liverpool Street station crowd blur. Photo by David Sim.
CASCADING IS MOST POPULAR AMONG BUILDERS AND MANAGERS OF BIG DATA APPLICATIONS
Confidential
CASCADING COMMUNITY MEMBERS ARE MATURE, PRODUCTION USERS
8%
26%
25%
41%
How long have you been using Hadoop?
0-12 months12-24 months24-36 monthsOver 3 years
N=69
Most respondents have been using Hadoop for over 3 years. Assuming the sample is representative, the Cascading community largely consists of early Hadoop adopters.
Furthermore, the Cascading community isn’t just dabbling: Over 84% have already put their Cascading applications into production or plan to do so.
As for why, many likely found out the hard way that developing directly on Hadoop was painful, tedious and poorly suited to scale.
0 5 10 15 20 25 30 35 40 45
Other
Poor integration into existing IT infrastructure
Lack of scalability
Lack of portability across compute fabrics
Difficult to integrate to existing systems
Poor troubleshooting capabilities
Lack of skilled Hadoop resources
High cost of development in existing platform
Slow development in existing platform
What challenges did you have that made you look for an application development framework?
Confidential
THE PATH TO CASCADING: HIVE, PIG, AND GUI TOOLS
N=69
Given the maturity of Cascading users, it’s no surprise that many explored alternatives before settling on Cascading. The majority (51%) tried Hive and Pig, both of which were early abstraction layers for MapReduce. Today, many Pig applications run alongside Cascading and many Hive applications run within Cascading.
Why didn’t they stick with Hive and Pig? Most organizations determined they could not scale with Hive and Pig. Typically that was because Hive and Pig required scarce technical resources and because development in those frameworks was slow. Those who opted for other API frameworks found them not yet ready for the enterprise.
A smaller group experimented with GUI-‐based ETL tools. While these tools made it easy to leverage existing resources and skill sets, their capabilities were too limited. They also required building special scripts to achieve complex functionality, which negated the benefits of simplicity. Additionally, many users did not like being locked into a single-‐vendor solution.
26%
25%22%
19%
8%
Before selecting Cascading, what alternative solutions did you explore? (select all that apply)
Pig
Hive
Other API frameworks (Spark, Crunch) GUI-based ETL tools (Talend, Informatica, Pentaho) No other alternatives were explored
Confidential
0 10 20 30 40 50 60
Other
Flink
Tez
Storm
Kafka
MapReduce
Spark
Which compute fabric(s) are you using or planning to use in the next 18 mths?
PORTABILITY ACROSS FABRICS
N=69
New compute fabrics appear all the time, though not all are production-‐ready. The responses reflect high interest in Spark and a desire for true streaming (not micro-‐batches).
MapReduce isn’t going away any time soon, especially where reliability is a requirement. Still, many are experimenting with other compute fabrics. Because each fabric offers application-‐specific advantages, most organizations will likely wind up running multiple fabrics.
Cascading 3.0 supports Tez, MapReduce, and local/in-‐memory, so users can port applications from MapReduce to Tez simply by changing a few lines of code. Easy portability makes Cascading an ideal platform for moving from MapReduce to Tez without incurring the cost of rewriting applications. Soon, Cascading will support the same portability for Spark and Flink (for Flink, support will be community contributed).
Confidential
CASCADING BRIDGES OTHER DEVELOPMENT FRAMEWORKS
N=69
Despite their shortcomings, MapReduce, Hive and Pig are still widely in use as development frameworks, largely because many early Hadoop applications were built through these interfaces. No surprise that we see a lot of excitement about Spark as a new development framework as well; many users are experimenting with developing directly in the Spark API.
Cascading will support Spark in a future WIP, adding an important framework option for Spark developers. Developers who build in Cascading will be able to port their applications from MapReduce to Spark without having to rewrite them in the Spark API.
In summary, there is no one-‐size-‐fits-‐all framework. Flexibility is key as organizations build out their big data strategies and platforms.
Cascalog
Scalding
Pig
Hive
MapReduce
Cascading
Spark
0 10 20 30 40 50 60
What data application development framework do you use?
“[Cascading] Best Hadoop API for enterprise data-intensive apps.” – Architect. Fortune 500 Healthcare Payer
Confidential
COMMON USE CASES: ETL, ANALYTICS & DATA INTEGRATION
N=69
Most organizations rely on Hadoop for heavy processing steps within ETL, analytics or data integration flows. Some have moved their entire ETL processing to Hadoop, while others have moved only portions of their workflows.
For example, AirBnB uses Cascading for complicated infrastructure tasks such as data normalization and cleansing. AirBnB also leverages Cascading for reconstructing corrupted files and merging data. In combination with Cascading, Pig and Hive are used by analysts to run batch scripts to perform ad hoc analysis.
With these tools, analysts are able to more easily study crucial metrics like click-‐through rates, page statistics, and drop-‐off rates.
0 10 20 30 40 50
Other
Search Optimization
Recommendation Engines
Data Quality
Machine Learning and Scoring
Data Integration
Analytics
ETL
What best describes the projects where you are using Cascading?
45%Offloading
ETL to Hadoop
40%To Support Analytics/BI
Projects
33%Data
Integration Projects
Confidential
Extremely likely - 10
23%
910%
820%
719%
611%
56%
41%
33%
24%
Not at all likely - 0
3%
How likely is it that you would recommend Cascading to a friend or
colleague?
WHY THEY LOVE CASCADING: TDD, JAVA API, PORTABILITY
N=79
Top 3 Most Impactful Capabilitiesv Test Driven Development (49%) -‐ Efficiently test code and process
local files before you deploy on a cluster with Cascading’s local or in-‐memory mode. Incorporate inline data assertions to define results at any point in your pipeline. Failed assertions are easily visible and available for analysis.
v JavaAPI (44%) -‐ Cascading is a Java library and does not require installation. Cascading fits directly into a standard development process; all you have to do is code to the API.
v Application Portability (43%) -‐ When you compile a Cascading job, it automatically creates a run-‐time executable for your specified compute fabric. Simply by changing a few lines of code, you can test your application on multiple fabrics and choose the best for your needs.
53%Of Respondents are Promoters
(8/10)
Confidential
CASCADING IMPROVES PRODUCTIVITY
N=79
7%
16%
7%
18%26%
16%
10%
What percentage would you estimate the productivity of your staff has improved?
Over 300%Over 100%80%-100%60%-80%40%-60%20%-40%Less than 20%
Most increased productivity by at least 40%
Confidential
CASCADING SLASHES TIME TO MARKET
N=79
Most improved time to market by at least
40%
5%
17%
12%
18%17%
18%
13%
What percentage would you estimate your time to market has improved?
Over 300%Over 100%80%-100%60%-80%40%-60%20%-40%Less than 20%
Confidential
N=69
0 10 20 30 40 50 60
Other
Supporting chargeback models
Forecasting big data infrastructure needs
Monitoring SLA's for Hadoop applications
Identify and resolve Hadoop application issues faster
Optimizing application performance
What future challenges do you anticipate in managing your data applications?
THE FUTURE: BETTER PERFORMANCE, DATA PIPELINE VISIBILITYApplication performance management is a top-‐of-‐mind concern for most respondents. While performance tuning happens on the operations side, optimizing applications to meet service-‐ level commitments is usually a collaborative effort between development and operations teams.
Developers need better tools to visualize data pipelines and detect undesirable behavior before they promote applications to production. Operations teams need better tools to monitor, manage and optimize data delivery.
An important, though secondary concern, is tracking the rate of Hadoop resource consumption so clusters can be right-‐sized and costs distributed across divisions. This is particularly true as more of of an organization’s departments/teams build and rely on big data applications, transforming their Hadoop cluster from a side project into core production IT infrastructure.
With new application performance management tools such as Driven, teams can visualize data pipelines and identify unwanted behavior more effectively. Tools like Driven also arm teams with the data necessary to pinpoint issues quickly and resolve them collaboratively.
Confidential
DISTRIBUTIONS
0 5 10 15 20 25 30 35 40
Count of Other (please specify)
Count of MapR
Count of Hortonworks
Count of Apache Hadoop
Count of Amazon EMR
Count of Cloudera
DistributionsN=69
Confidential
NUMBER OF APPLICATIONS AND VOLUME
Over 100 60-100 30-60 15-30 5-15 1-5Less than 250 pipelines 4 5 4 26500 - 1,000 pipelines 2 2 1 1 2250 - 500 pipelines 1 3 52,500 - 5,000 pipelines 1 11,000 - 2,500 pipelines 2 3 1Over 5,000 pipelines 1�Over 10,000 pipelines 1 1 2
0
5
10
15
20
25
30
35
40
Average Number of Cascading Applications and Pipelines N=69