ask bigger questions with cloudera and apache hadoop - big data day paris 2013

33
1 Ask Bigger Ques,ons with Cloudera and Apache Hadoop Graham Gear [email protected] JUNE 2013

Upload: xebia-france

Post on 06-May-2015

5.953 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

1

Ask  Bigger  Ques,ons  with  Cloudera  and  Apache  Hadoop  Graham  Gear  [email protected]  JUNE  2013  

   

Page 2: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Data  Has  Changed  in  the  Last  30  Years  DA

TA  GRO

WTH

 

END-­‐USER  APPLICATIONS  

THE  INTERNET  

MOBILE  DEVICES  

SOPHISTICATED  MACHINES  

STRUCTURED  DATA  –  10%  

1980   2012  

UNSTRUCTURED  DATA  –  90%  

Page 3: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Data  Management  Strategies  Have  Stayed  the  Same  

 •  Raw  data  on  SAN,  NAS  

and  tape    •  Data  moved  from  

storage  to  compute    •  Rela,onal  models  with  

predesigned  schemas  

Page 4: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Too  Much  Data,  Too  Many  Sources  

•  Can’t  ingest  fast  enough  

Page 5: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Too  Much  Data,  Too  Many  Sources  

$ !

$ $

$

•  Can’t  ingest  fast  enough    

•  Costs  too  much  to  store  

Page 6: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Too  Much  Data,  Too  Many  Sources  

1

2 3 4 5

•  Can’t  ingest  fast  enough    

•  Costs  too  much  to  store    

•  Exists  in  different  places  

Page 7: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Too  Much  Data,  Too  Many  Sources  

•  Can’t  ingest  fast  enough    

•  Costs  too  much  to  store    

•  Exists  in  different  places    

•  Archived  data  is  lost  

Page 8: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Can’t  Use  It  The  Way  You  Want  To  

•  Analysis  and  processing  takes  too  long  

Page 9: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Can’t  Use  It  The  Way  You  Want  To  

1

2 3 4 5

•  Analysis  and  processing  takes  too  long    

•  Data  exists  in  silos  

Page 10: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Can’t  Use  It  The  Way  You  Want  To  

? ? ? •  Analysis  and  processing  takes  too  long    

•  Data  exists  in  silos    

•  Can’t  ask  new  ques,ons  

Page 11: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Can’t  Use  It  The  Way  You  Want  To  

•  Analysis  and  processing  takes  too  long    

•  Data  exists  in  silos    

•  Can’t  ask  new  ques,ons    

•  Can’t  analyze  unstructured  data  

Page 12: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

12

Transform  The  Way  You  Think  About  Data  

Cloudera  

Page 13: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Ask  Bigger  Ques,ons  

13  

When  customer  x  visits  my  store  what  can  I  recommend  based  on  their  recent  web  behavior  across  our  various  brand  websites?  

What  is  the  best  loca,on  in  North  America  to  efficiently  produce  both  tomato  plants  and  corn?  

What  does  every  fraudulent  ac,vity  in  the  last  2  years  have  in  common  that  will  help  us  iden,fy  and  proac,vely  prevent  the  next  incident?  

Are  hotel  room  sales  at  Christmas  slow  because  of  inventory  or  compe,,ve  pricing?    

What  did  customer  x  view  on  their  last  website  visit?    

`  What  makes  tomato  plants  more  frui[ul  than  others  ?    

What  incidents  of  fraud  did  we  detect  last  year?    

What  search  terms  are  used  most  o\en  when  looking  for  hotels  in  NYC?    

                       

                       

                       

                       

Page 14: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

                               SIMPLIFIED,  UNIFIED,  EFFICIENT  

•  Bulk  of  data  stored  on  scalable  low  cost  pla[orm  •  Perform  end-­‐to-­‐end  workflows  •  Specialized  systems  reserved  for  specialized  workloads  •  Provides  data  access  across  departments  or  LOB  

     COMPLEX,  FRAGMENTED,  COSTLY  

•Data  silos  by  department  or  LOB  •  Lots  of  data  stored  in  expensive  specialized  systems    •  Analysts  pull  select  data  into  EDW  •  No  one  has  a  complete  view  

 

The  Cloudera  Approach  

14  

Meet  enterprise  demands  with  a  new  way  to  think  about  data.  

THE  CLOUDERA  WAY  THE  OLD  WAY  Single  data  pla[orm  to  support  BI,  Repor,ng  &    

App  Serving  

Mul,ple  pla[orms    for  mul,ple  workloads  

Page 15: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

   

INGEST   STORE   EXPLORE   PROCESS   ANALYZE   SERVE  

CDH   CLOUDERA  MANAGER  

CLOUDERA  SUPPORT  

Cloudera  Enterprise:  The  Pla[orm  for  Big  Data  

15  

BRINGS  STORAGE  &  COMPUTE  TOGETHER  

WORKS  WITH  EVERY  TYPE  OF  DATA  

CHANGES  THE  ECONOMICS  OF  DATA  

MANGAGEMENT  

A  Revolu,onary  Solu,on  Built  on  Apache  Hadoop  

CLOUDERA  NAVIGATOR  

Page 16: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

16  

Cloudera  Enterprise  Includes  Advanced  System  Management  &  Support  for  the  Core  CDH  Projects  

   

CDH  100%  OPEN  SOURCE  HADOOP  DISTRIBUTION  

CLOUDERA  MANAGER  END-­‐TO-­‐END  SYSTEM  MANAGEMENT  

CORE  PROJECTS   PREMIUM  PROJECTS   CONNECTORS  

HDFS   MAPREDUCE   FLUME   HCATALOG  

MICROSTRATEGY  

NETEZZA  

ORACLE  

QLIKVIEW  

TABLEAU  

TERADATA  

HIVE   HUE   MAHOUT   OOZIE  

PIG   SQOOP   WHIRR   ZOOKEEPER  

HBASE  

IMPALA  

SEARCH  (BETA)  

DEPLOYMENT   MONITORING   API   SNMP   CONFIG  ROLLBACKS   PHONE  HOME  

SERVICE  MGMT   DIAGNOSTICS   ROLLING  UPGRADES   LDAP   REPORTING   BACKUP/DR  

CLOUDERA  SUPPORT  BEST-­‐IN-­‐CLASS  TECHNICAL  SUPPORT,  COMMUNICTY  ADVOCACY  &  INDEMNIFICATION  

CLOUDERA  NAVIGATOR  END-­‐TO-­‐END  DATA  MANAGEMENT  

ACCESS  MGMT   DATA  AUDIT  

CORE  HADOOP  PROJECTS  

CLOUDERA  MANAGER  

CLOUDERA  NAVIGATOR   HBASE   IMPALA   Search  

Page 17: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

17  

RTD  SubscripVon  Includes  Support  &  Indemnity  for  Apache  HBase  

   

CDH  100%  OPEN  SOURCE  HADOOP  DISTRIBUTION  

CLOUDERA  MANAGER  END-­‐TO-­‐END  SYSTEM  MANAGEMENT  

CORE  PROJECTS   PREMIUM  PROJECTS   CONNECTORS  

HDFS   MAPREDUCE   FLUME   HCATALOG  

MICROSTRATEGY  

NETEZZA  

ORACLE  

QLIKVIEW  

TABLEAU  

TERADATA  

HIVE   HUE   MAHOUT   OOZIE  

PIG   SQOOP   WHIRR   ZOOKEEPER  

HBASE  

IMPALA  

SEARCH  (BETA)  

DEPLOYMENT   MONITORING   API   SNMP   CONFIG  ROLLBACKS   PHONE  HOME  

SERVICE  MGMT   DIAGNOSTICS   ROLLING  UPGRADES   LDAP   REPORTING   BACKUP/DR  

CLOUDERA  SUPPORT  BEST-­‐IN-­‐CLASS  TECHNICAL  SUPPORT,  COMMUNICTY  ADVOCACY  &  INDEMNIFICATION  

CLOUDERA  NAVIGATOR  END-­‐TO-­‐END  DATA  MANAGEMENT  

ACCESS  MGMT   DATA  AUDIT  

CORE  HADOOP  PROJECTS  

CLOUDERA  MANAGER  

CLOUDERA  NAVIGATOR   HBASE   IMPALA   Search  

Page 18: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

18  

RTQ  SubscripVon  Includes  Support  &  Indemnity  for  Cloudera  Impala  

   

CDH  100%  OPEN  SOURCE  HADOOP  DISTRIBUTION  

CLOUDERA  MANAGER  END-­‐TO-­‐END  SYSTEM  MANAGEMENT  

CORE  PROJECTS   PREMIUM  PROJECTS   CONNECTORS  

HDFS   MAPREDUCE   FLUME   HCATALOG  

MICROSTRATEGY  

NETEZZA  

ORACLE  

QLIKVIEW  

TABLEAU  

TERADATA  

HIVE   HUE   MAHOUT   OOZIE  

PIG   SQOOP   WHIRR   ZOOKEEPER  

HBASE  

IMPALA  

SEARCH  (BETA)  

DEPLOYMENT   MONITORING   API   SNMP   CONFIG  ROLLBACKS   PHONE  HOME  

SERVICE  MGMT   DIAGNOSTICS   ROLLING  UPGRADES   LDAP   REPORTING   BACKUP/DR  

CLOUDERA  SUPPORT  BEST-­‐IN-­‐CLASS  TECHNICAL  SUPPORT,  COMMUNICTY  ADVOCACY  &  INDEMNIFICATION  

CLOUDERA  NAVIGATOR  END-­‐TO-­‐END  DATA  MANAGEMENT  

ACCESS  MGMT   DATA  AUDIT  

CORE  HADOOP  PROJECTS  

CLOUDERA  MANAGER  

CLOUDERA  NAVIGATOR   HBASE   IMPALA   Search  

Page 19: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

19  

RTS  SubscripVon  Includes  Support  &  Indemnity  for  Cloudera  Search  

   

CDH  100%  OPEN  SOURCE  HADOOP  DISTRIBUTION  

CLOUDERA  MANAGER  END-­‐TO-­‐END  SYSTEM  MANAGEMENT  

CORE  PROJECTS   PREMIUM  PROJECTS   CONNECTORS  

HDFS   MAPREDUCE   FLUME   HCATALOG  

MICROSTRATEGY  

NETEZZA  

ORACLE  

QLIKVIEW  

TABLEAU  

TERADATA  

HIVE   HUE   MAHOUT   OOZIE  

PIG   SQOOP   WHIRR   ZOOKEEPER  

HBASE  

IMPALA  

SEARCH  (BETA)  

DEPLOYMENT   MONITORING   API   SNMP   CONFIG  ROLLBACKS   PHONE  HOME  

SERVICE  MGMT   DIAGNOSTICS   ROLLING  UPGRADES   LDAP   REPORTING   BACKUP/DR  

CLOUDERA  SUPPORT  BEST-­‐IN-­‐CLASS  TECHNICAL  SUPPORT,  COMMUNICTY  ADVOCACY  &  INDEMNIFICATION  

CLOUDERA  NAVIGATOR  END-­‐TO-­‐END  DATA  MANAGEMENT  

ACCESS  MGMT   DATA  AUDIT  

CORE  HADOOP  PROJECTS  

CLOUDERA  MANAGER  

CLOUDERA  NAVIGATOR   HBASE   Search  IMPALA  

Page 20: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

20  

BDR  SubscripVon  Includes  Centralized  Management  For  Disaster  Recovery  Workflows  

   

CDH  100%  OPEN  SOURCE  HADOOP  DISTRIBUTION  

CLOUDERA  MANAGER  END-­‐TO-­‐END  SYSTEM  MANAGEMENT  

CORE  PROJECTS   PREMIUM  PROJECTS   CONNECTORS  

HDFS   MAPREDUCE   FLUME   HCATALOG  

MICROSTRATEGY  

NETEZZA  

ORACLE  

QLIKVIEW  

TABLEAU  

TERADATA  

HIVE   HUE   MAHOUT   OOZIE  

PIG   SQOOP   WHIRR   ZOOKEEPER  

HBASE  

IMPALA  

SEARCH  (BETA)  

DEPLOYMENT   MONITORING   API   SNMP   CONFIG  ROLLBACKS   PHONE  HOME  

SERVICE  MGMT   DIAGNOSTICS   ROLLING  UPGRADES   LDAP   REPORTING   BACKUP/DR  

CLOUDERA  SUPPORT  BEST-­‐IN-­‐CLASS  TECHNICAL  SUPPORT,  COMMUNICTY  ADVOCACY  &  INDEMNIFICATION  

CLOUDERA  NAVIGATOR  END-­‐TO-­‐END  DATA  MANAGEMENT  

ACCESS  MGMT   DATA  AUDIT  

CORE  HADOOP  PROJECTS  

CLOUDERA  MANAGER  

CLOUDERA  NAVIGATOR   HBASE   IMPALA   Search  

Page 21: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

21  

Navigator  SubscripVon  Enables  Cloudera  Navigator  for  Automated  Data  Management  

   

CDH  100%  OPEN  SOURCE  HADOOP  DISTRIBUTION  

CLOUDERA  MANAGER  END-­‐TO-­‐END  SYSTEM  MANAGEMENT  

CORE  PROJECTS   PREMIUM  PROJECTS   CONNECTORS  

HDFS   MAPREDUCE   FLUME   HCATALOG  

MICROSTRATEGY  

NETEZZA  

ORACLE  

QLIKVIEW  

TABLEAU  

TERADATA  

HIVE   HUE   MAHOUT   OOZIE  

PIG   SQOOP   WHIRR   ZOOKEEPER  

HBASE  

IMPALA  

SEARCH  (BETA)  

DEPLOYMENT   MONITORING   API   SNMP   CONFIG  ROLLBACKS   PHONE  HOME  

SERVICE  MGMT   DIAGNOSTICS   ROLLING  UPGRADES   LDAP   REPORTING   BACKUP/DR  

CLOUDERA  SUPPORT  BEST-­‐IN-­‐CLASS  TECHNICAL  SUPPORT,  COMMUNICTY  ADVOCACY  &  INDEMNIFICATION  

CLOUDERA  NAVIGATOR  END-­‐TO-­‐END  DATA  MANAGEMENT  

ACCESS  MGMT   DATA  AUDIT  

CORE  HADOOP  PROJECTS  

CLOUDERA  MANAGER  

CLOUDERA  NAVIGATOR   HBASE   IMPALA   Search  

Page 22: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

22

Customer  Case  Studies  

   

Page 23: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

A  mul,na,onal  bank  saves  millions  by  op,mizing  DW  for  analy,cs  &  reducing  data  

storage  costs  by  99%.    

Ask  Bigger  Ques,ons:  How  can  we  op,mize  our  

data  warehouse  investment?  

Page 24: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Cloudera  op,mizes  the  EDW,  saves  millions  

24  

The  Challenge:  •  Teradata  EDW  at  capacity:  ETL  processes  consume  7  days;  takes  5  weeks  to  make  historical  data  available  for  analysis  

•  Performance  issues  in  business  cri,cal  apps;  liqle  room  for  discovery,  analy,cs,  ROI  from  opportuni,es  

Mul,na,onal  bank  saves  millions  by  op,mizing  exis,ng  DW  for  analy,cs  &  reducing  data  storage  costs  by  99%.  

The  Solu,on:  •  Cloudera  Enterprise  offloads  data  storage,  processing  &  some  analy,cs  from  EDW  

•  Teradata  can  focus  on  opera,onal  func,ons  &  analy,cs  

Page 25: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

A  Semiconductor  Manufacturer  uses    predic,ve  analy,cs  to  take  preventa,ve  ac,on  

on  chips  likely  to  fail.  

Ask  Bigger  Ques,ons:  Which  semiconductor  

chips  will  fail?  

Page 26: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Cloudera  enables  beqer  predic,ons  

26  

The  Challenge:  • Want  to  capture  greater  granular  and  historical  data  for  more  accurate  predic,ve  yield  modeling  

•  Storing  9  months’  data  on  Oracle  is  expensive      

Semiconductor  manufacturer  can  prevent  chip  failure  with  more  accurate  predic,ve  yield  models.  

The  Solu,on:  • Dell  |  Cloudera  solu,on  for  Apache  Hadoop  

• 53  nodes;  plan  to  store  up  to  10  years  (~10PB)  

• Capturing  &  processing  data  from  each  phase  of  manufacturing  process  

CONFIDENTIAL  -­‐  RESTRICTED  

Page 27: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

The  quant  risk  LOB  within  a  mul,na,onal  bank  saves  millions  through  beqer  risk  exposure  

analysis  &  fraud  preven,on.  

Ask  Bigger  Ques,ons:  How  can  we  prevent  

fraud?  

Page 28: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Cloudera  delivers  savings  through  fraud  preven,on  

28  

The  Challenge:  •  Fraud  detec,on  is  a  cumbersome,  mul,-­‐step  analy,c  process  requiring  data  sampling  

•  2B  transac,ons/month  necessitate  constant  revisions  to  risk  profiles  • Highly  tuned  100TB  Teradata  DW  drives  over-­‐budget  capital  reserves  &  lower  investment  returns  

Quant  risk  LOB  in  mul,na,onal  bank  saves  millions  through  beqer  risk  exposure  analysis  &  fraud  preven,on  

The  Solu,on:  •  Cloudera  Enterprise  data  factory  for  fraud  preven,on,  credit  &  opera,onal  risk  analysis  

•  Look  at  every  incidence  of  fraud  for  5  years  for  each  person  

•  Reduced  costs;  expensive  CPU  no  longer  consumed  by  data  processing  

Page 29: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

BlackBerry  eliminates  data  sampling  &  simplifies  data  processing  for  beqer,  more  

comprehensive  analysis.  

Ask  Bigger  Ques,ons:  How  do  we  retain  customers  in  a  compe,,ve  market?  

Page 30: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Cloudera  delivers  ROI  through  storage  alone  

30  

The  Challenge:  •  BlackBerry  Services  generates  .5PB  (50-­‐60TB  compressed)  data  per  day  •  RDBMS  is  expensive  –  limited  to  1%  data  sampling  for  analy,cs  

BlackBerry  can  analyze  all  their  data  vs.  relying  on  1%  sample  for  beqer  network  capacity  trending  &  management.  

The  Solu,on:  •  Cloudera  Enterprise  manages  global  data  set  of  ~100PB  

•  Collec,ng  device  content,  machine-­‐generated  log  data,  audit  details  

•  90%  ETL  code  base  reduc,on  

Page 31: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

31

A  global  retailer’s  customers  benefit  from  more  personalized  communica,ons  and  offers  

based  on  interac,ons  across  all  channels.    

Ask  Bigger  Ques,ons:  How  can  we  offer  customers  

the  best  experience?  

Page 32: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

Cloudera  op,mizes  the  DW  for  improved  ROI  

32  

Global  retailer’s  customers  benefit  from  more  personalized  communica,ons  based  on  interac,ons  across  all  channels.  

The  Solu,on:  •  Cloudera  Enterprise  with  Impala  —  1PB  over  250  nodes  

•  Consolidated  pla[orm  for  Big  Data  with  single  environment  for  query  and  machine  learning  

         

CONFIDENTIAL  -­‐  RESTRICTED  

The  Challenge:  •   Need  to  correlate  online/offline  data  across  disparate,  costly  legacy  DWs  •   Data  takes  up  to  4  weeks  to  get  data  from  one  group  –  inhibits  produc,vity    

Page 33: Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013

33

Any  Ques,ons,  Big  or  Small?