cascading 2015 user survey results

Confidential

The Rise of Cascading

2015 Cascading User Survey Results

Confidential

WHAT’S BEHIND THE RISE OF CASCADING?Enterprise IT teams designing their big data platforms must choose from a daunting array of development frameworks and compute fabrics. On the one hand, they want a development framework that leverages existing skillsets. At the same time, they want the flexibility to benefit from performance gains of the latest, greatest compute fabrics.

Cascading is a robust framework with over 10,000 known production deployments, over 275,000 downloads per month. Twitter, AirBnB, Climate Corp, Apple, EBay, Netflix, are examples of few of the enterprises that have built their Hadoop practices with Cascading. The Cascading user group is diverse, self-‐supporting community who are helping innovate Cascading’s scalability, portability, performance and value. In addition, the presence of a large number of open source projects contributed by mainstream enterprises such as by Netflix, Commonwealth Bank of Australia, Expedia attests to vibrancy of the Cascading ecosystem.

In this paper, we'll reveal what’s behind Cascading's growth by digging into the results of a new Cascading user survey. In general, Cascading users turn out to be extremely concerned about reliability and performance at scale. Many experimented with early Hadoop frameworks like Hive and Pig, but found Cascading to be a more scalable approach. And lately, the easy portability of Cascading applications between compute fabrics has generated a lot of excitement in the community.

Confidential

0 10 20 30 40 50 60 70Head/VP of IT

Head of IT InfrastructureApplication Manager/Director

BI/EDW Manager/DirectorCIO/SVP of IT

IT SpecialistArchitect

IT Manager or DirectorDeveloper/Engineer

What title best describes your role?

N=121 Liverpool Street station crowd blur. Photo by David Sim.

CASCADING IS MOST POPULAR AMONG BUILDERS AND MANAGERS OF BIG DATA APPLICATIONS

Confidential

CASCADING COMMUNITY MEMBERS ARE MATURE, PRODUCTION USERS

8%

26%

25%

41%

How long have you been using Hadoop?

0-12 months12-24 months24-36 monthsOver 3 years

N=69

Most respondents have been using Hadoop for over 3 years. Assuming the sample is representative, the Cascading community largely consists of early Hadoop adopters.

Furthermore, the Cascading community isn’t just dabbling: Over 84% have already put their Cascading applications into production or plan to do so.

As for why, many likely found out the hard way that developing directly on Hadoop was painful, tedious and poorly suited to scale.

0 5 10 15 20 25 30 35 40 45

Other

Poor integration into existing IT infrastructure

Lack of scalability

Lack of portability across compute fabrics

Difficult to integrate to existing systems

Poor troubleshooting capabilities

Lack of skilled Hadoop resources

High cost of development in existing platform

Slow development in existing platform

What challenges did you have that made you look for an application development framework?

Confidential

THE PATH TO CASCADING: HIVE, PIG, AND GUI TOOLS

N=69

Given the maturity of Cascading users, it’s no surprise that many explored alternatives before settling on Cascading. The majority (51%) tried Hive and Pig, both of which were early abstraction layers for MapReduce. Today, many Pig applications run alongside Cascading and many Hive applications run within Cascading.

Why didn’t they stick with Hive and Pig? Most organizations determined they could not scale with Hive and Pig. Typically that was because Hive and Pig required scarce technical resources and because development in those frameworks was slow. Those who opted for other API frameworks found them not yet ready for the enterprise.

A smaller group experimented with GUI-‐based ETL tools. While these tools made it easy to leverage existing resources and skill sets, their capabilities were too limited. They also required building special scripts to achieve complex functionality, which negated the benefits of simplicity. Additionally, many users did not like being locked into a single-‐vendor solution.

26%

25%22%

19%

8%

Before selecting Cascading, what alternative solutions did you explore? (select all that apply)

Pig

Hive

Other API frameworks (Spark, Crunch) GUI-based ETL tools (Talend, Informatica, Pentaho) No other alternatives were explored

Confidential

0 10 20 30 40 50 60

Other

Flink

Tez

Storm

Kafka

MapReduce

Spark

Which compute fabric(s) are you using or planning to use in the next 18 mths?

PORTABILITY ACROSS FABRICS

N=69

New compute fabrics appear all the time, though not all are production-‐ready. The responses reflect high interest in Spark and a desire for true streaming (not micro-‐batches).

MapReduce isn’t going away any time soon, especially where reliability is a requirement. Still, many are experimenting with other compute fabrics. Because each fabric offers application-‐specific advantages, most organizations will likely wind up running multiple fabrics.

Cascading 3.0 supports Tez, MapReduce, and local/in-‐memory, so users can port applications from MapReduce to Tez simply by changing a few lines of code. Easy portability makes Cascading an ideal platform for moving from MapReduce to Tez without incurring the cost of rewriting applications. Soon, Cascading will support the same portability for Spark and Flink (for Flink, support will be community contributed).

Confidential

CASCADING BRIDGES OTHER DEVELOPMENT FRAMEWORKS

N=69

Despite their shortcomings, MapReduce, Hive and Pig are still widely in use as development frameworks, largely because many early Hadoop applications were built through these interfaces. No surprise that we see a lot of excitement about Spark as a new development framework as well; many users are experimenting with developing directly in the Spark API.

Cascading will support Spark in a future WIP, adding an important framework option for Spark developers. Developers who build in Cascading will be able to port their applications from MapReduce to Spark without having to rewrite them in the Spark API.

In summary, there is no one-‐size-‐fits-‐all framework. Flexibility is key as organizations build out their big data strategies and platforms.

Cascalog

Scalding

Pig

Hive

MapReduce

Cascading

Spark

0 10 20 30 40 50 60

What data application development framework do you use?

“[Cascading] Best Hadoop API for enterprise data-intensive apps.” – Architect. Fortune 500 Healthcare Payer

Confidential

COMMON USE CASES: ETL, ANALYTICS & DATA INTEGRATION

N=69

Most organizations rely on Hadoop for heavy processing steps within ETL, analytics or data integration flows. Some have moved their entire ETL processing to Hadoop, while others have moved only portions of their workflows.

For example, AirBnB uses Cascading for complicated infrastructure tasks such as data normalization and cleansing. AirBnB also leverages Cascading for reconstructing corrupted files and merging data. In combination with Cascading, Pig and Hive are used by analysts to run batch scripts to perform ad hoc analysis.

With these tools, analysts are able to more easily study crucial metrics like click-‐through rates, page statistics, and drop-‐off rates.

0 10 20 30 40 50

Other

Search Optimization

Recommendation Engines

Data Quality

Machine Learning and Scoring

Data Integration

Analytics

ETL

What best describes the projects where you are using Cascading?

45%Offloading

ETL to Hadoop

40%To Support Analytics/BI

Projects

33%Data

Integration Projects

Confidential

Extremely likely - 10

23%

910%

820%

719%

611%

56%

41%

33%

24%

Not at all likely - 0

3%

How likely is it that you would recommend Cascading to a friend or

colleague?

WHY THEY LOVE CASCADING: TDD, JAVA API, PORTABILITY

N=79

Top 3 Most Impactful Capabilitiesv Test Driven Development (49%) -‐ Efficiently test code and process

local files before you deploy on a cluster with Cascading’s local or in-‐memory mode. Incorporate inline data assertions to define results at any point in your pipeline. Failed assertions are easily visible and available for analysis.

v JavaAPI (44%) -‐ Cascading is a Java library and does not require installation. Cascading fits directly into a standard development process; all you have to do is code to the API.

v Application Portability (43%) -‐ When you compile a Cascading job, it automatically creates a run-‐time executable for your specified compute fabric. Simply by changing a few lines of code, you can test your application on multiple fabrics and choose the best for your needs.

53%Of Respondents are Promoters

(8/10)

Confidential

CASCADING IMPROVES PRODUCTIVITY

N=79

7%

16%

7%

18%26%

16%

10%

What percentage would you estimate the productivity of your staff has improved?

Over 300%Over 100%80%-100%60%-80%40%-60%20%-40%Less than 20%

Most increased productivity by at least 40%

Confidential

CASCADING SLASHES TIME TO MARKET

N=79

Most improved time to market by at least

40%

5%

17%

12%

18%17%

18%

13%

What percentage would you estimate your time to market has improved?

Over 300%Over 100%80%-100%60%-80%40%-60%20%-40%Less than 20%

Confidential

N=69

0 10 20 30 40 50 60

Other

Supporting chargeback models

Forecasting big data infrastructure needs

Monitoring SLA's for Hadoop applications

Identify and resolve Hadoop application issues faster

Optimizing application performance

What future challenges do you anticipate in managing your data applications?

THE FUTURE: BETTER PERFORMANCE, DATA PIPELINE VISIBILITYApplication performance management is a top-‐of-‐mind concern for most respondents. While performance tuning happens on the operations side, optimizing applications to meet service-‐ level commitments is usually a collaborative effort between development and operations teams.

Developers need better tools to visualize data pipelines and detect undesirable behavior before they promote applications to production. Operations teams need better tools to monitor, manage and optimize data delivery.

An important, though secondary concern, is tracking the rate of Hadoop resource consumption so clusters can be right-‐sized and costs distributed across divisions. This is particularly true as more of of an organization’s departments/teams build and rely on big data applications, transforming their Hadoop cluster from a side project into core production IT infrastructure.

With new application performance management tools such as Driven, teams can visualize data pipelines and identify unwanted behavior more effectively. Tools like Driven also arm teams with the data necessary to pinpoint issues quickly and resolve them collaboratively.

Confidential

APPENDIX

Confidential

DISTRIBUTIONS

0 5 10 15 20 25 30 35 40

Count of Other (please specify)

Count of MapR

Count of Hortonworks

Count of Apache Hadoop

Count of Amazon EMR

Count of Cloudera

DistributionsN=69

Confidential

NUMBER OF APPLICATIONS AND VOLUME

Over 100 60-100 30-60 15-30 5-15 1-5Less than 250 pipelines 4 5 4 26500 - 1,000 pipelines 2 2 1 1 2250 - 500 pipelines 1 3 52,500 - 5,000 pipelines 1 11,000 - 2,500 pipelines 2 3 1Over 5,000 pipelines 1�Over 10,000 pipelines 1 1 2

0

5

10

15

20

25

30

35

40

Average Number of Cascading Applications and Pipelines N=69

Confidential

PRODUCTION STATUS

0 5 10 15 20 25 30 35 40 45 50

No and not planned

Not yet but planned

Yes

Are you using your Cascading data applications in a production environment?

N=69

cascading 2015 user survey results

Technology

rise of cascading

andguitools n

years n

poor integration

cascading user survey

specialist architect

compute fabrics difficult

ontheone hand