cascading 2015 user survey results

16
Confidential The Rise of Cascading 2015 Cascading User Survey Results

Upload: cascading

Post on 17-Aug-2015

12 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Cascading 2015 User Survey Results

Confidential

The  Rise  of  Cascading

2015  Cascading  User  Survey  Results

Page 2: Cascading 2015 User Survey Results

Confidential

WHAT’S  BEHIND  THE  RISE  OF  CASCADING?Enterprise  IT  teams  designing   their  big  data  platforms  must  choose  from  a  daunting  array  of  development   frameworks  and  compute  fabrics.  On  the  one  hand,   they  want  a  development  framework  that  leverages  existing  skillsets.  At  the  same  time,  they  want  the  flexibility   to  benefit  from  performance  gains  of  the  latest,  greatest  compute  fabrics.  

Cascading  is  a  robust  framework  with  over  10,000  known  production  deployments,   over  275,000  downloads   per  month.  Twitter,  AirBnB,  Climate  Corp,   Apple,   EBay,  Netflix,  are  examples  of  few  of  the  enterprises  that  have  built  their  Hadoop  practices  with  Cascading.  The  Cascading  user  group  is  diverse,   self-­‐supporting   community  who  are  helping  innovate  Cascading’s  scalability,   portability,   performance  and  value.  In  addition,   the  presence  of  a  large  number   of  open   source  projects  contributed  by  mainstream  enterprises  such  as  by  Netflix,   Commonwealth  Bank  of  Australia,  Expedia  attests  to  vibrancy  of  the  Cascading  ecosystem.

In  this  paper,  we'll  reveal  what’s   behind  Cascading's  growth  by  digging  into  the  results  of  a  new  Cascading  user  survey.   In  general,  Cascading  users  turn  out  to  be  extremely  concerned  about  reliability   and  performance  at  scale.  Many  experimented  with  early  Hadoop  frameworks  like  Hive  and  Pig,  but  found  Cascading  to  be  a  more  scalable  approach.  And  lately,  the  easy  portability  of  Cascading  applications   between  compute  fabrics  has  generated  a  lot  of  excitement  in  the  community.  

Page 3: Cascading 2015 User Survey Results

Confidential

0 10 20 30 40 50 60 70Head/VP of IT

Head of IT InfrastructureApplication Manager/Director

BI/EDW Manager/DirectorCIO/SVP of IT

IT SpecialistArchitect

IT Manager or DirectorDeveloper/Engineer

What title best describes your role?

N=121 Liverpool   Street   station   crowd  blur.  Photo   by David  Sim.

CASCADING  IS  MOST  POPULAR  AMONG  BUILDERS  AND  MANAGERS  OF  BIG  DATA  APPLICATIONS    

Page 4: Cascading 2015 User Survey Results

Confidential

CASCADING  COMMUNITY  MEMBERS  ARE  MATURE,  PRODUCTION  USERS

8%

26%

25%

41%

How long have you been using Hadoop?

0-12 months12-24 months24-36 monthsOver 3 years

N=69

Most  respondents   have  been  using  Hadoop  for  over  3  years.  Assuming   the  sample  is  representative,  the  Cascading  community   largely  consists   of  early  Hadoop  adopters.  

Furthermore,  the  Cascading  community  isn’t   just  dabbling:  Over  84% have  already  put  their  Cascading  applications   into  production  or  plan  to  do  so.  

As  for  why,  many  likely  found  out  the  hard  way  that  developing   directly  on  Hadoop  was  painful,   tedious   and  poorly   suited  to  scale.

0 5 10 15 20 25 30 35 40 45

Other

Poor integration into existing IT infrastructure

Lack of scalability

Lack of portability across compute fabrics

Difficult to integrate to existing systems

Poor troubleshooting capabilities

Lack of skilled Hadoop resources

High cost of development in existing platform

Slow development in existing platform

What challenges did you have that made you look for an application development framework?

Page 5: Cascading 2015 User Survey Results

Confidential

THE  PATH  TO  CASCADING:  HIVE,  PIG,  AND  GUI  TOOLS

N=69

Given  the  maturity  of  Cascading  users,   it’s   no  surprise   that  many  explored  alternatives  before  settling  on  Cascading.  The  majority  (51%)  tried  Hive  and  Pig,  both  of  which  were  early  abstraction  layers  for  MapReduce.  Today,  many  Pig  applications   run  alongside  Cascading  and  many  Hive  applications   run  within Cascading.    

Why  didn’t   they  stick  with  Hive  and  Pig?  Most  organizations  determined  they  could  not  scale  with  Hive  and  Pig.  Typically   that  was  because  Hive  and  Pig  required  scarce  technical  resources  and  because  development   in  those  frameworks  was  slow.  Those  who  opted  for  other  API  frameworks  found  them  not  yet  ready  for  the  enterprise.  

A  smaller  group  experimented  with  GUI-­‐based  ETL  tools.  While   these  tools  made  it  easy  to  leverage  existing  resources  and  skill   sets,  their  capabilities  were  too  limited.  They  also  required  building   special  scripts  to  achieve  complex  functionality,   which  negated  the  benefits   of  simplicity.    Additionally,   many  users   did  not  like  being  locked  into  a  single-­‐vendor   solution.

26%

25%22%

19%

8%

Before selecting Cascading, what alternative solutions did you explore? (select all that apply)

Pig

Hive

Other API frameworks (Spark, Crunch) GUI-based ETL tools (Talend, Informatica, Pentaho) No other alternatives were explored

Page 6: Cascading 2015 User Survey Results

Confidential

0 10 20 30 40 50 60

Other

Flink

Tez

Storm

Kafka

MapReduce

Spark

Which compute fabric(s) are you using or planning to use in the next 18 mths?

PORTABILITY  ACROSS  FABRICS

N=69

New  compute fabrics  appear  all  the  time,  though  not  all  are  production-­‐ready.   The  responses   reflect high  interest  in  Spark  and  a  desire  for  true  streaming  (not  micro-­‐batches).    

MapReduce isn’t going  away any  time  soon,   especially  where  reliability  is  a  requirement.    Still,  many  are  experimenting  with other  compute  fabrics.  Because  each  fabric  offers  application-­‐specific  advantages,  most  organizations  will  likely  wind  up  running  multiple  fabrics.  

Cascading  3.0  supports   Tez,  MapReduce,  and  local/in-­‐memory,   so  users  can  port  applications  from  MapReduce to  Tez simply   by  changing  a  few  lines   of  code.    Easy  portability  makes  Cascading  an  ideal  platform  for  moving  from  MapReduce to  Tez without  incurring  the  cost  of  rewriting  applications.   Soon,   Cascading  will  support  the  same  portability  for  Spark  and  Flink (for  Flink,   support  will  be  community   contributed).  

Page 7: Cascading 2015 User Survey Results

Confidential

CASCADING  BRIDGES  OTHER  DEVELOPMENT  FRAMEWORKS

N=69

Despite  their  shortcomings,  MapReduce,  Hive  and  Pig  are  still  widely  in  use  as  development  frameworks,  largely  because  many  early  Hadoop   applications  were  built  through  these  interfaces.  No  surprise   that    we  see  a  lot  of  excitement  about  Spark  as  a  new  development   framework  as  well;  many  users  are  experimenting  with  developing  directly  in  the  Spark  API.  

Cascading  will  support   Spark  in  a  future  WIP,  adding  an  important  framework  option   for  Spark  developers.  Developers  who  build   in  Cascading  will  be  able  to  port  their  applications   from  MapReduce to  Spark  without  having  to  rewrite  them  in  the  Spark  API.

In  summary,   there  is  no  one-­‐size-­‐fits-­‐all   framework.  Flexibility   is  key  as  organizations  build   out  their  big  data  strategies  and  platforms.  

Cascalog

Scalding

Pig

Hive

MapReduce

Cascading

Spark

0 10 20 30 40 50 60

What data application development framework do you use?

“[Cascading] Best Hadoop API for enterprise data-intensive apps.” – Architect. Fortune 500 Healthcare Payer

Page 8: Cascading 2015 User Survey Results

Confidential

COMMON  USE  CASES:  ETL,  ANALYTICS  &  DATA  INTEGRATION

N=69

Most  organizations  rely  on  Hadoop  for  heavy  processing   steps  within  ETL,  analytics   or  data  integration  flows.  Some  have  moved  their  entire  ETL  processing   to  Hadoop,  while  others  have  moved  only   portions   of  their  workflows.    

For  example,  AirBnB uses   Cascading  for  complicated  infrastructure  tasks  such  as  data  normalization  and  cleansing.  AirBnB also  leverages  Cascading  for  reconstructing  corrupted  files  and  merging  data.  In  combination  with  Cascading,   Pig  and  Hive  are  used  by  analysts  to  run  batch  scripts  to  perform  ad  hoc  analysis.  

With  these  tools,   analysts  are  able  to  more  easily  study   crucial  metrics  like  click-­‐through   rates,  page  statistics,  and  drop-­‐off   rates.  

0 10 20 30 40 50

Other

Search Optimization

Recommendation Engines

Data Quality

Machine Learning and Scoring

Data Integration

Analytics

ETL

What best describes the projects where you are using Cascading?

45%Offloading

ETL to Hadoop

40%To Support Analytics/BI

Projects

33%Data

Integration Projects

Page 9: Cascading 2015 User Survey Results

Confidential

Extremely likely - 10

23%

910%

820%

719%

611%

56%

41%

33%

24%

Not at all likely - 0

3%

How likely is it that you would recommend Cascading to a friend or

colleague?

WHY  THEY  LOVE  CASCADING:  TDD,  JAVA  API,  PORTABILITY

N=79

Top  3  Most  Impactful  Capabilitiesv Test  Driven  Development  (49%)  -­‐ Efficiently  test  code  and  process  

local  files  before  you  deploy   on  a  cluster  with  Cascading’s   local  or  in-­‐memory  mode.  Incorporate  inline   data  assertions   to  define   results  at  any  point   in  your  pipeline.    Failed  assertions   are  easily  visible   and  available  for  analysis.

v JavaAPI  (44%)  -­‐ Cascading  is  a  Java  library  and  does   not  require  installation.  Cascading  fits  directly  into  a  standard  development  process;  all  you  have  to  do  is  code  to  the  API.

v Application  Portability  (43%)  -­‐ When   you  compile  a  Cascading  job,   it  automatically  creates  a  run-­‐time  executable  for  your  specified  compute  fabric.  Simply   by  changing  a  few  lines   of  code,  you  can  test  your  application  on  multiple   fabrics  and  choose   the  best  for  your  needs.  

53%Of Respondents are Promoters

(8/10)

Page 10: Cascading 2015 User Survey Results

Confidential

CASCADING  IMPROVES  PRODUCTIVITY

N=79

7%

16%

7%

18%26%

16%

10%

What percentage would you estimate the productivity of your staff has improved?

Over 300%Over 100%80%-100%60%-80%40%-60%20%-40%Less than 20%

Most  increased  productivity  by  at  least  40%

Page 11: Cascading 2015 User Survey Results

Confidential

CASCADING  SLASHES  TIME  TO MARKET

N=79

Most  improved  time  to  market  by  at  least  

40%

5%

17%

12%

18%17%

18%

13%

What percentage would you estimate your time to market has improved?

Over 300%Over 100%80%-100%60%-80%40%-60%20%-40%Less than 20%

Page 12: Cascading 2015 User Survey Results

Confidential

N=69

0 10 20 30 40 50 60

Other

Supporting chargeback models

Forecasting big data infrastructure needs

Monitoring SLA's for Hadoop applications

Identify and resolve Hadoop application issues faster

Optimizing application performance

What future challenges do you anticipate in managing your data applications?

THE  FUTURE:  BETTER  PERFORMANCE,  DATA  PIPELINE  VISIBILITYApplication   performance  management  is  a  top-­‐of-­‐mind   concern  for  most  respondents.  While  performance  tuning  happens   on  the  operations  side,   optimizing  applications   to  meet  service-­‐ level  commitments  is  usually   a  collaborative  effort  between  development  and  operations teams.  

Developers  need  better  tools   to  visualize  data  pipelines   and  detect  undesirable   behavior  before they  promote  applications   to  production.    Operations  teams  need  better  tools  to  monitor,  manage  and  optimize  data  delivery.  

An  important,  though  secondary  concern,  is  tracking  the  rate  of  Hadoop  resource  consumption   so  clusters  can  be  right-­‐sized  and  costs  distributed   across  divisions.   This  is  particularly  true  as  more  of  of  an  organization’s  departments/teams  build   and  rely  on  big  data  applications,   transforming  their  Hadoop  cluster  from  a  side  project  into  core  production  IT  infrastructure.  

With  new  application   performance  management  tools  such   as  Driven,  teams  can  visualize  data  pipelines   and  identify  unwanted  behavior  more  effectively.  Tools  like  Driven  also  arm  teams  with  the  data  necessary   to  pinpoint   issues   quickly   and  resolve  them  collaboratively.

Page 13: Cascading 2015 User Survey Results

Confidential

APPENDIX

Page 14: Cascading 2015 User Survey Results

Confidential

DISTRIBUTIONS

0 5 10 15 20 25 30 35 40

Count of Other (please specify)

Count of MapR

Count of Hortonworks

Count of Apache Hadoop

Count of Amazon EMR

Count of Cloudera

DistributionsN=69

Page 15: Cascading 2015 User Survey Results

Confidential

NUMBER OF APPLICATIONS AND VOLUME

Over 100 60-100 30-60 15-30 5-15 1-5Less than 250 pipelines 4 5 4 26500 - 1,000 pipelines 2 2 1 1 2250 - 500 pipelines 1 3 52,500 - 5,000 pipelines 1 11,000 - 2,500 pipelines 2 3 1Over 5,000 pipelines 1�Over 10,000 pipelines 1 1 2

0

5

10

15

20

25

30

35

40

Average Number of Cascading Applications and Pipelines N=69

Page 16: Cascading 2015 User Survey Results

Confidential

PRODUCTION STATUS

0 5 10 15 20 25 30 35 40 45 50

No and not planned

Not yet but planned

Yes

Are you using your Cascading data applications in a production environment?

N=69