houston hadoop meetup presentation by vikram oberoi of cloudera

27
What is Hadoop, and When Should I Consider Using It? Houston HUG June 6 th , 2011 Vikram Oberoi, Cloudera Copyright 2011 Cloudera Inc. All rights reserved

Upload: markkerzner

Post on 06-May-2015

1.598 views

Category:

Technology


1 download

DESCRIPTION

When and why to use Hadoop. Hadoop-able problems and use cases.

TRANSCRIPT

Page 1: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

What  is  Hadoop,  and  When  Should  I  Consider  Using  It?  

Houston  HUG  June  6th,  2011  

Vikram  Oberoi,  Cloudera  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 2: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

About  me  

•  Data  engineer  at  Cloudera,  present  •    Using  data  and  Hadoop  to  enable  more  responsive  support  

•  Data  engineer  at  Meebo,  Aug  ’09  –  Nov’10  •  Data  infrastructure,  analyLcs  

•  CS  at  Stanford,  ’09  •  Senior  project:  ext3  and  XFS  under  Hadoop  MapReduce  workloads  

•  Data  engineer  at  Meebo,  ’08  •  Built  an  A/B  tesLng  system  

•  SDE  Intern  at  Amazon,  ’07  •  R&D  on  item-­‐to-­‐item  similariLes  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 3: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

What  will  I  talk  about?  

•  What  is  Hadoop?    •  Typical  Hadoop-­‐able  problems  and  use  cases  

 

•  Cloudera  overview  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 4: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

What  is  Hadoop?  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 5: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.

.

Big  Data  Problem:  Exploding  Data  Volumes  

•  Online  •  Web-­‐ready  devices  •  Social  media  •  Digital  content  

•  Enterprise  •  TransacLons    •  R&D  data  •  OperaLonal  (control)  data  

•  Open  data  iniLaLves  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Relational

Complex, Unstructured

•  2,500 exabytes of new information in 2012 with Internet as primary driver •  Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

Page 6: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Big  Data  Problem:  Data  Economics  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Low  ROB  

•   Return  on  Byte  =  value  to  be  extracted  from  that  byte  /  cost  of  storing  that  byte  •   If  ROB  is  <  1  then  it  will  be  buried  into  tape  wasteland,  thus  we  need  cheaper  ac#ve  storage.  

High  ROB  

Page 7: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

MapReduce  

Hadoop  Distributed  File  System  (HDFS)  

Hadoop:  A  Data  PlaEorm  with  Unique  Benefits  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

•   Consolidates  Everything  •   Move  complex  and  relaLonal    data  into  a  single  repository  

•   Stores  Inexpensively  •   Keep  raw  data  always  available  •   Use  commodity  hardware  

•   Processes  at  the  Source  •   Eliminate  ETL  boglenecks  •   Mine  data  first,  govern  later    

Page 8: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Hadoop  Distributed  File  System  (HDFS)  

•  Based  on  design  of  Google’s  GFS  •  Data  stored  in  large  files  

•  Files  can  contain  any  data  •  Files  separated  into  blocks  

•  64MB  up  to  256MB  per  block  (tunable)  •  Each  block  replicated  across  a  cluster  (tunable,  usually  3  replicas  across  the  cluster)  

•  This  buys  you:  fault  tolerance,  parallelizable  disk  reads  •  Store  whatever  you  want  in  it  

•  This  buys  you:  flexibility    

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

“How  is  data  stored?”  

Page 9: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

MapReduce  

•  Framework  designed  for  parallel  processing  of  large  disk  bound  batch  jobs  

•  Data  processed  at  the  source  •  File  ‘foo’  has  5  blocks,  processing  happens  on  5  nodes  •  Parallelized  disk  reads  à  remove  disk  bogleneck  

•  Way  to  express  algorithms  such  that  they  are  parallelizable  

•  Two  funcLons  at  the  core  of  every  job:  •  Map  funcLon  (group  by)  •  Reduce  funcLon  (perform  acLon  on  group)  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

“How  is  data  processed?”  

Page 10: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

What  is  Hadoop?  

•  A  scalable  fault-­‐tolerant  distributed  system    for  data  storage  and  processing  (open  source  under  the  Apache  license)  

•  Scalable  data  processing  engine  •  Hadoop  Distributed  File  System  (HDFS):  self-­‐healing  high-­‐bandwidth  clustered  storage  

•  MapReduce:  fault-­‐tolerant  distributed  processing    •  Key  value  

•  Flexible  -­‐>  store  data  without  a  schema  and  add  it  later  as  needed  •  Affordable  -­‐>  cost  /  TB  at  a  fracLon  of  tradiLonal  opLons  •  Broadly  adopted  -­‐>  a  large  and  acLve  ecosystem  •  Proven  at  scale  -­‐>  dozens  of  petabyte  +  implementaLons  in  producLon  today  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 11: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Cloudera’s  DistribuSon  Including  Apache  Hadoop  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

 

•  Open  source  –  100%  Apache  licensed  and  free  for  download  •  Simplified  –  Component  versions  &  dependencies  managed  for  you  •  Integrated  –  All  components  &  funcLons  interoperate  through  standard  API’s  •  Reliable  –  Patched  with  fixes  from  future  releases  to  improve  stability  •  Supported  –  Employs  project  founders  and  commigers  for  >90%  of  components  

Hue   Hue  SDK  

Oozie  Oozie  

HBase  Flume,  Sqoop  

Zookeeper  

Hive  

Pig/  Hive  

The  Industry’s  Leading  Hadoop  Distribu<on  

Page 12: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Typical  Hadoop-­‐able  problems  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 13: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

What  is  common  across  Hadoop-­‐able  problems?  

Nature  of  the  data  

•  Complex  data  •  MulLple  data  sources  •  Lots  of  it  

Nature  of  the  analysis  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   13  

•  Batch  processing  •  Parallelizable  

Page 14: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

What  kinds  of  analyses  are  possible  with  Hadoop?  

•  Text  mining  

•  Index  building  

•  Graph  creaLon  and  analysis  

•  Pagern  recogniLon  

•  CollaboraLve  filtering  

•  PredicLon  models  

•  SenLment  analysis  

•  Risk  assessment  

 

Copyright  2010  Cloudera  Inc.  All  rights  reserved   14  

Page 15: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Top  10  Hadoop-­‐able  Problems  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

1.   Modeling  True  Risk  

2.   Customer  Churn  Analysis  

3.   RecommendaSon  engines  

4.   Ad  TargeSng  

5.   Point  Of  Sale  TransacSon  Analysis  

6.   Analysing  Network  Data  To  Predict  Failure  

7.   Threat  Analysis/Fraud  DetecSon  

8.   Trade  Surveillance  

9.   Search  Quality  

10.  Data  “Sandbox”  

See  archived  webinar  on  cloudera.com  

Page 16: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:  Modeling  True  Risk  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   16  

Page 17: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:  Modeling  True  Risk  

SoluSon  with  Hadoop  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   17  

•  Source,  parse  and  aggregate  disparate  data    sources  to  build  comprehensive  data  picture  •  e.g.  credit  card  records,  call  recordings,  chat  sessions,  emails,  banking  acLvity  

•  Structure  and  analyze  •  SenLment  analysis,  graph  creaLon,  pagern  recogniLon  

Typical  Industry  

•  Financial  Services  (Banks,  Insurance)    

Page 18: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:  Threat  Analysis  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   18  

Page 19: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:  Threat  Analysis  

SoluSon  with  Hadoop  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   19  

•  Parallel  processing  over  huge  datasets  

•  Pagern  recogniLon  to  idenLfy  anomalies  i.e.  threats  

Typical  Industry  

•  Security  •  Financial  Services  •  General:  spam  fighLng,    click  fraud    

Page 20: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:  RecommendaSon  Engine  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   20  

Page 21: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:    RecommendaSon  Engine  SoluSon  with  Hadoop  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   21  

•  Batch  processing  framework  •  Allow  execuLon  in  in  parallel  over  large  datasets  

•  CollaboraLve  filtering  •  CollecLng  ‘taste’  informaLon  from  many  users  •  ULlizing  informaLon  to  predict  what  similar  users  like  

Typical  Industry  

•  Ecommerce,  Manufacturing,  Retail    

Page 22: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:  Analyzing  Network  Data  to  Predict  Failure  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   22  

Page 23: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:  Analyzing  Network  Data  to  Predict  Failure  SoluSon  with  Hadoop  

Copyright  2010  Cloudera  Inc.  All  rights  reserved   23  

•  Take  the  computaLon  to  the  data  •  Expand  the  range  of  indexing  techniques  from  simple  

scans  to  more  complex  data  mining    •  Beger  understand  how  the  network  reacts  to  fluctuaLons  

•  How  previously  thought  discrete  anomalies  may,  in  fact,  be  interconnected  

•  IdenLfy  leading  indicators  of  component  failure  

Typical  Industry  •  ULliLes,  TelecommunicaLons,    

Data  Centers    

Page 24: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Example:  SupporSng  Hadoop  at  Cloudera  

•  Collect  data  from  customer  clusters  •  OS  configs,  Hadoop  configs,  command  outputs,  logs  •  Data  served  by  HBase,  used  by  supporters  

•  Consolidate  data  about  Hadoop  in  HDFS  •  Mailing  lists,  issue  trackers,  wiki  pages,  IRC,  books  •  Customer  cluster  data  

•  Analyze  many  data  sources  to  understand  Hadoop  issues  and  deployments  •  Build  tools  to  enable  easier  diagnosis  or  proacLve  support  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 25: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Cloudera  overview  

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Page 26: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Copyright  2011  Cloudera  Inc.  All  rights  reserved  

Cloudera  Offerings  Enabling  the  Enterprise  Adop<on  of  Apache  Hadoop  

PLATFORM   SUPPORT  &  APPLICATIONS  

PROFESSIONAL  SERVICES   TRAINING  

Page 27: Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Contact/Resources/QuesSons  

•  [email protected]  •  irc.freenode.net  #cloudera  #hadoop  •  @cloudera  

•  Cloudera  Groups:  hgp://groups.cloudera.org  •  Hadoop  the  DefiniLve  Guide  •  10  Hadoop-­‐able  problems  on  Slideshare  

•  QuesLons?  (P.S.  We’re  hiring  SA’s  in  Houston!)  

Copyright  2011  Cloudera  Inc.  All  rights  reserved