bi, hive or big data analytics?

27
© 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved. BI, Hive or Big Data Analytics?

Upload: datameer

Post on 15-May-2015

631 views

Category:

Technology


1 download

DESCRIPTION

As more organizations look to Hadoop as the technology solution for big data analytics, common questions arise. Join us in this case study look at an online services provider's experience with Big Data and how they answered the questions: *What does big data analytics do that my existing BI software doesn’t? *Will Hadoop replace my data warehouse? *What about Hive?

TRANSCRIPT

Page 1: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

© 2012 Datameer, Inc. All rights reserved.

BI, Hive or Big Data Analytics?

Page 2: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

View the Recording of these Slides!

You can view the full recording of this on-demand webinar with slides at:

http://info.datameer.com/Slideshare-BI-Hive-Big-Data-Analytics.html

!

Page 3: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

About our Speaker!Todd Nash!!Todd is a founding Principal at CBIG Consulting, a professional services firm that helps clients leverage their data assets to produce timely, effective business strategies and tactical decisions. Todd leads CBIG’s eastern region consulting practice in the development, implementation, and execution of business intelligence and Big Data methodologies, cloud-based analytics strategies, and complex data warehousing solutions.!!Todd graduated from Clemson University with a Bachelor of Science degree in Management Information Systems.!

Page 4: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

About our Speaker!Eduardo Rosas!!Eduardo Rosas is Vice President of Services at Datameer and brings over 12 years of software implementation experience to the table.!!In this role, Eduardo is focused on delivering repeatable, high quality level of services and support to help clients achieve their goals. !!Prior to Datameer, Eduardo spent 11 years at Trintech where he focused on managing a team of Technical Consultants and implementing global Java web based solutions. Eduardo is originally from San Jose, CA and graduated from Santa Clara University.!!

Page 5: BI, Hive or Big Data Analytics?

Agenda  

•  Problem  Statement  –  Business  &  Technical  •  POC  Technical  Solu;on  –  High-­‐level  and  Detailed  •  Results  •  Lessons  Learned  

Copyright  ©  2013  CBIG  Consul;ng  5  

Page 6: BI, Hive or Big Data Analytics?

PROBLEM STATEMENT

Copyright  ©  2013  CBIG  Consul;ng  6  

Page 7: BI, Hive or Big Data Analytics?

Business  Problem  Statement  

Copyright  ©  2013  CBIG  Consul;ng  7  

SEARCH  

 

IMPRESSIONS  

 

CLICK-­‐THRU  

 

LEAD  

 

Breadth:    •  Searches  to  Impressions  to  Click  Thru  to  Leads  •  Website  op;miza;on  •  Customer  op;miza;on  &  upgrades  •  Market  op;miza;on  Depth:  •  Can  the  search  criteria  be  op;mized?  •  Conversion  of  impressions  based  on  refinement  of  search?  •  Which  product  mix  of  impressions  get  the  greatest  click  thru  •  What  is  the  impact  of  ameni;es  to  leads?  •  What  addi;onal  features  get  used  to  convert  to  leads?  

A  Real  Estate  .com  business  makes  money  in  two  ways:  1. Property  Owners  adver;se  proper;es  2. Ancillary  businesses  adver;se  services  This  site  needs  the  analy;cs  to  show  customers  the  return  on  their  investment    

Page 8: BI, Hive or Big Data Analytics?

Technical  Problem  Statement  

•  Search & Impressions volume too large to build cube and provide deep analytics •  This has a negative impact on all reporting and performance of the entire system •  The business is unable to determine the value of all the data; has requests to add more •  Evaluating options to increase environment or look for alternatives •  POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge

Copyright  ©  2013  CBIG  Consul;ng  8  

Lookup  Data  

Master  Data  

Search  

Web  Ac7vity  

Data  

Movem

ent  

Source  

Source  

Source  

Source  

ODS  

Data  M

ovem

ent  

EDW  

Data  M

ovem

ent  

Search  Cube  

Sales  Cube  

Marke7ng  Cube  Se

rvice  

Search  &  Impression  

Search  &  Impression  

Page 9: BI, Hive or Big Data Analytics?

Technical  Problem  Statement  

•  Search & Impressions volume too large to build cube and provide deep analytics •  This has a negative impact on all reporting and performance of the entire system •  The business is unable to determine the value of all the data; has requests to add more •  Evaluating options to increase environment or look for alternatives •  POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge

Copyright  ©  2013  CBIG  Consul;ng  9  

Lookup  Data  

Master  Data  

Search  

Web  Ac7vity  

Data  

Movem

ent  

Source  

Source  

Source  

Source  

ODS  

Data  M

ovem

ent  

EDW  

Data  M

ovem

ent  

Sales  Cube  

Marke7ng  Cube  Se

rvice  

Search  &  Impression  

 Search  

Page 10: BI, Hive or Big Data Analytics?

Problem  Statement  –  Success  Criteria  

Copyright  ©  2013  CBIG  Consul;ng  10  

Objec7ve:  To  prove  that  the  Hadoop  architecture  is  an  excellent  op;on  for  the  business  to  interact  with  large  data  and  find  dataset  and  rela;onships  that  require  deeper  analy;cs.      Original  Scope  &  Goals:  •  Bring  in  one  years  worth  of  data  from  6  tables,  into  the  Amazon  Cloud  Hadoop  environment.  

•  IT  resources  will  be  able  to  extract  the  data  from  these  tables  and  load  them  into  .CSV  files.      

•  The  success  criteria  for  this  stream  of  work  will  be:    ü  Amazon  Hadoop  cloud  environment  &  account  is  setup.  ü  Search  Analy;cs  data  loaded  into  the  Amazon  Hadoop  cloud    ü  Business  is  able  to  execute  and  perform  analy;cs  on  Search  Analy;cs  data  

that  is  stored  in  Hadoop  with  acceptable  performance.    ü  Gain  analy;cal  insights  with  new  solu;on  

Page 11: BI, Hive or Big Data Analytics?

POC TECHNICAL SOLUTION

Copyright  ©  2013  CBIG  Consul;ng  11  

Page 12: BI, Hive or Big Data Analytics?

POC  Technical  Solu;on  –  High  Level  

12  

Web  Ac7vity  History    

Lookup  Data  

Amazon  Web  Services  (Cloud)  

AWS  S3  

AWS  EMR    

(Hadoop)  

Datameer  (Data  

Discovery)  

Web  Portal  (Widget  Based  

UI)  

Copyright  ©  2013  CBIG  Consul;ng  

Page 13: BI, Hive or Big Data Analytics?

WebVisit  

WebSearch  

WebLead  

Web  Impressions  

WebClicks  

POC  Technical  Solu;on  -­‐  Detailed  

AllLeads  Da

ta  M

ovem

ent  

Amazon  Cloud  

S3                                                                      

WebVisit  

WebSearch  

WebLead  

Web  Impressions  

WebClicks  

AllLeads  

Phone  Leads  

Other  Leads  

LR  Apts  IMPS  

Generic  Ac;vity  

EmailLeads  

Affliate  

Contaniner  Type  

Email  Type  

Event  Type  

Lead  Type  

SearchType  

Property  List  

Product  ID  

PhoneType  

PageType  

Site   SubSite  

Phone  Leads  

Other  Leads  

LR  Apts  IMPS  

Generic  Ac;vity  

EmailLeads  

Affliate  

Contaniner  Type  

Email  Type  

Event  Type  

Lead  Type  

SearchType  

Property  List  

Product  ID  

PhoneType  

PageType  

Site   SubSite  

Hadoop                                                                      

Data  Workbooks  AllLeads  WebClicks  

Web  Impressions  WebLeads  WebSearch  WebVisits  

 Use  Case  Workbooks  

Use  Case1  Use  Case  2  

 

Data  M

ovem

ent  

Addi7onal  Data  Workbooks    

Addi7onal  Use  Cases    

Page 14: BI, Hive or Big Data Analytics?

RESULTS

Copyright  ©  2013  CBIG  Consul;ng  14  

Page 15: BI, Hive or Big Data Analytics?

POC  Results                Success  Criteria                Results    

Copyright  ©  2013  CBIG  Consul;ng  15  

Environment  setup  within  the  1st  couple  of  days          Loaded  significantly  more  data  than  planned  for  more  robust  analy;cs    Business  leveraged  Datameer  to  execute  use  cases;  executed  ~20  addi;onal  without  IT  help          Queries  executed  to  comple;on.  Some  took  seconds,  some  took  minutes  and  some  required  overnight.    1st  ;me  able  to  run  these  analy;cs.    Found  pajerns  and  rela;onships  contrary  to  assump;ons.    Will  be  upda;ng  service  offerings  &  marke;ng  plans  because  of  POC  

Hadoop,  Amazon,  Datameer  environment  setup    Able  to  load  1  years  worth  of  data  –  nearly  1.3  TB    Business  able  to  execute  and  perform  analy;cs      Users  provided  acceptable  performance    Gain  new  insights  

Page 16: BI, Hive or Big Data Analytics?

LESSONS LEARNED

Copyright  ©  2013  CBIG  Consul;ng  16  

Page 17: BI, Hive or Big Data Analytics?

Lessons  Learned  

Copyright  ©  2013  CBIG  Consul;ng  17  

GETTING  DATA  TO  HADOOP  

       Hadoop  is  file  structure              Finding  the  right  delimiter  

       Integra;ng  data              Requires  ETL  

       Data  cleansing  can  be  big              Several  itera;ons  required  

PEOPLE  

       Remember  change  mgmt              Educa;on  new  methods  &  tools  

HADOOP  

       Hadoop  is  batch                Answers  one  thing  at  a  ;me  

       Analy;cs              Move  to  database  w/  tools  

CLOUD  

       Cloud  flexible              Easy  setup  and  scaling  

       Performance  &  sizing              Sizing  the  cloud  is  challenging  

       Cost  for  performance              TBs  with  support  becomes  costly  

Page 18: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

© 2012 Datameer, Inc. All rights reserved.

So what about open source tools like hive?

Page 19: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

Hive…!

!   Prerequisites!•  Must have data in hadoop!•  The data must be CLEAN!•  Schema must be applied to the

data by creating a hive table!

!   Goal of hive!•  Eases the complexity of writing

MapReduce jobs by providing the technical user a set of tools that are more familiar with via sql!

!   Who can use hive?!•  SQL Users can pick up hql basics fairly

quickly!

Page 20: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

What is hive really good at?!!   Hive is good in environments where we have clean prepared

data that doesn’t change often already in hadoop!!!   Resembles a language that many IT folks are already familiar

with.!!!   Hive can help a user trying to identify a reporting trend!!!   User defined fields (UDFs) can be used to reuse functions!

Page 21: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

<< - Start of Hive script ->>--Create an TEMP Housing TableCREATE EXTERNAL TABLE MY_TABLE(num_ods string,num_bus_id int,um_ctry_cd int,prod_id string,rng_svc_cd string,rng6 string,bin string,bin_bus_id_enr int,bin_ctry_cd int,cd_fmt_a_2 string,cd_enr string,rsn_us_ind string,x_bus_id int,flg_enr string,my_dt string,user_id string,mthd_cd_enr string,tran_seq_id string,cd_enr2 string,us_amt string,moto_cd string,fee_curr_cd int,fee_desc_num string,fee_sgn_amt string,us_fee_sgn_amt string,mkt_spec string,catg_cd int,city_enr string,ctry_cd_enr int,dba_id int,nm_dscrptr string,geo_id int,geo_phone_num string,tier_cd string,msa string,nrmlzd_id int,pstl_cd string,b_st_cd_enr string,b_store_id string,b_vrfcn_val string,ntwrk_id int,site string,

entry_mode_cd string,term_cpbty_cd string,sub_typ_cd string,dt string,id_num_enr int,prod_num int,prod_ppd_sub_typ_cd string,prod_typ_cd_enr string,prod_typ_ext_enr string,promo_cd string,promo_typ string,rwds_pgm_id_enr string,tran_cd string,tran_gmt_dt string,tran_gmt_tm string,tran_id string,unfrzn_acct_num_bus_id_enr int,unfrzn_arn_bin_bus_id_enr int,usage_cd_enr string,Other_amt string,curr_cd int,dt string,)COMMENT "THIS IS MY TEMP TABLE";--INSERT DATA INTO MY_TABLEINSERT OVERWRITE MY_TABLE select * , SUM(us_tran_amt) AS SALES_VOL,SUM(US_FEE_SGN_AMT) AS US_FEE_SGN_AMT,COUNT(*) AS TRAN_COUNT,MIN(ACTIVE_DT) AS FIRST_ACTIVE_DT,MAX(SEARCH_DT) AS LAST_SEARCH_DT,MAX(customer_biz_id) AS customer_biz_id, MAX(PGM_ID_ENR) AS PGM_ID_ENR, MAX(CUST_PROD_ID) AS CUST_PROD_ID , MAX(POD_ID_NUM_ENR) AS POD_ID_NUM_ENR,MAX(PROD_TYPE)AS PROD_TYPE, MAX(SUB_TYPE) AS SUB_TYPE, 1 as IDfrom MY_TABLEWHERE dt like '2012%' GROUP BY customer_biz_id, PGM_ID_ENR, CUST_PROD_ID,

eci_moto_cd, catg_cd, city_enr,ctry_cd_enr, pstl_cd, pod_id, prod_num, SUB_TYPE;--CREATE TEMP LOOKUP TABLE CREATE EXTERNAL TABLE TEMP_LOOKUP(acct_num bigint,acct_sta_cd string,acct_zip_cd string,rwrd_pgm_id string,pgm_ref_cd string,acct_prod_id string,bus_id int,bin int,status string,pgm_eff_dt string,dt string,)COMMENT "THIS IS TEMP LOOKUP TABLE";--INSERT DATA INTO ITINSERT OVERWRITE MY_LOOKUPSELECT *, 1 as cmf_indFROM LOOKUPWHERE DT = '201211';--Do a Full Outer JoinSELECT * FROM MY_TABLE mtFULL OUTER JOIN MY_LOOKUP mlON mt.member_id = ml.member_id;

Some troubles!

!   No way to get data in hadoop!!   No data validation / may throw data away!!   Security !!   Sharing code via teams is a challenge!!   No visualization!

Page 22: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

… but it’s free right?!!  "Time to create Hive": 

Any machine-generated data (or anything semi/unstructured) must first be parsed by writing !!!MapReduce or Pig/Python programs.  Time-to-market disadvantage.

Table definition is a manual effort (though this can be made easier by 3rd party tools).!

!  "Time to maintain Hive": Hive data models (tables) are most likely static, shared objects maintained and controlled by a few people who own the schema !Hive is also more of a black box for new employees coming in (so employee churn creates more maintenance effort). !

!!  Cost to implement Hive:

This is mostly down to the human capital (expensive developers), and don't forget the prerequisite cost of implementing the data ingestion stage of the pipeline (populating the warehouse by writing MapReduce programs or other programs parsing/loading the data).  !

Page 23: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

Business decsion!

!!   Do I train my engineers on a language or

eliminate the need from this by taking the problem directly to the business user.!

!

Page 24: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

So what would my hive resource need to know?!!   Hive QL (different dialect than ANSI standard SQL)!

! MapReduce TUNING parameters.  (to name a few)!•  Data block size!•  Number of mappers/reducers!•  Compression at map out level; result compression; what codec to use!•  io.sort.factor !!

!   Access to hive is mainly done via Command line interface!

Page 25: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

How does Datameer do it differently!

Page 26: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

Questions and Answers!

Page 27: BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

Online Resources

§  Try Datameer: www.datameer.com!§  Follow us on Twitter @datameer!!

!