hadoop is dead - long live hadoop | bidata 2013 genoa

55
1 Hadoop is dead, long live Hadoop! Lars George | EMEA Chief Architect @larsgeorge A Eulogy and ProclamaAon

Upload: larsgeorge

Post on 08-Sep-2014

3.935 views

Category:

Technology


0 download

DESCRIPTION

Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation

TRANSCRIPT

Page 1: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

1

Hadoop  is  dead,  long  live  Hadoop!  

Lars  George    |    EMEA  Chief  Architect  @larsgeorge  

A  Eulogy  and  ProclamaAon  

Page 2: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

What  the  Press  Says…  

2

Source:  hFp://blogs.the451group.com/informaAon_management/2012/07/09/hadoop-­‐is-­‐dead-­‐long-­‐live-­‐hadoop/  

Page 3: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

3

Big  Data…  WTH?  A  brief  reasoning  for  Hadoop’s  existence.  

Page 4: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

4

—  Bubble  Buddy,  Head  of  IT  

Page 5: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Big  Data  –  A  Misnomer  

•  Misleading  to  quick  assumpAons  •  Current  challenges  are  driven  by  many  things,  not  just  the  size  of  data  

•  ANY  company  can  use  the  Big  Data  principles  to  improve  specific  business  metrics  •  Increased  data  retenAon  •  Access  to  all  the  data  •  Machine  learning  for  paFern  detecAon,  recommendaAons  

•  But  what  has  happened  to  cause  this  all?  

5

Page 6: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Explosive  Data  Growth  

6

10,000  

2005   2015  2010  

5,000  

0  

1.8 trillion gigabytes of  data  was  created  in  2011…  

§  More  than  90%  is  unstructured  data  §  Approx.  500  quadrillion  files  §  QuanAty  doubles  every  2  years  

STRUCTURED  DATA   UNSTRUCTURED  DATA  

GIGAB

YTES  OF  DA

TA  CRE

ATED

 (IN  BILLIONS)  

Source:  IDC  2011  

Page 7: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

The  ‘Big  Data’  Phenomenon  

7

Big  Data  Drivers:  §  The  proliferaAon  of  data  capture  

and  creaAon  technologies  

§  Increased  “interconnectedness”  drives  consumpAon  (creaAng  more  data)  

§  Inexpensive  storage  makes  it  possible  to  keep  more,  longer  

§  InnovaAve  somware  and  analysis  tools  turn  data  into  informaAon  

Big  Data  encompasses  not  only  the  content itself,  but  how it’s consumed.  

More Devices

More Consumption

More Content

New & Better Information

§  Every  gigabyte  of  stored  content  can  generate  a  petabyte  or  more  of  transient  data*  

§  The  informaAon  about  you  is  much  greater  than  the  informaAon  you  create  

*Source:  IDC  2011  

Page 8: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

The  Current  SoluAons  

8

10,000  

2005   2015  2010  

5,000  

0  

Current Database Solutions are  designed  for  structured  data.  

§  OpAmized  to  answer  known  quesPons  quickly  §  Schemas  dictate  form/context  

§  Difficult  to  adapt  to  new  data  types  and  new  quesAons  

§  Expensive  at  Petabyte  scale  

STRUCTURED  DATA   UNSTRUCTURED  DATA  

GIGAB

YTES  OF  DA

TA  CRE

ATED

 (IN  BILLIONS)  

10%

Page 9: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Data  Management  Strategies  Have  Stayed  the  Same  

 •  Raw  data  on  SAN,  NAS  

and  tape    •  Data  moved  from  

storage  to  compute    •  RelaAonal  models  with  

predesigned  schemas  

Page 10: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Too  Much  Data,  Too  Many  Sources  

•  Can’t  ingest  fast  enough  

Page 11: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Too  Much  Data,  Too  Many  Sources  

$

!

$ $

$

•  Can’t  ingest  fast  enough    

•  Costs  too  much  to  store  

Page 12: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Too  Much  Data,  Too  Many  Sources  

1

2 3 4 5

•  Can’t  ingest  fast  enough    

•  Costs  too  much  to  store    

•  Exists  in  different  places  

Page 13: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Too  Much  Data,  Too  Many  Sources  

•  Can’t  ingest  fast  enough    

•  Costs  too  much  to  store    

•  Exists  in  different  places    

•  Archived  data  is  lost  

Page 14: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Can’t  Use  It  The  Way  You  Want  To  

•  Analysis  and  processing  takes  too  long  

Page 15: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Can’t  Use  It  The  Way  You  Want  To  

1

2 3 4 5

•  Analysis  and  processing  takes  too  long    

•  Data  exists  in  silos  

Page 16: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Can’t  Use  It  The  Way  You  Want  To  

? ? ? •  Analysis  and  processing  takes  too  long    

•  Data  exists  in  silos    

•  Can’t  ask  new  quesAons  

Page 17: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Can’t  Use  It  The  Way  You  Want  To  

•  Analysis  and  processing  takes  too  long    

•  Data  exists  in  silos    

•  Can’t  ask  new  quesAons    

•  Can’t  analyze  unstructured  data  

Page 18: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

The  Big  Data  Challenge  

18

VOLUME VARIETY

VELOCITY DEMANDS  A  

NEW  APPROACH  

Big  Data  Contains  Limitless  Insights…  

BUT  

WEB  LOGS  

SOCIAL  MEDIA  

TRANSACTIONAL  DATA  

SMART  GRIDS  

OPERATIONAL  DATA  

DIGITAL  CONTENT  

R&D  DATA  AD  IMPRESSIONS  

FILES  

Page 19: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Big  Data  Challenges  

19

Cost-­‐effecAvely  managing  the  volume, velocity and variety of  data  

Deriving  value  across  structured and unstructured data  

AdapAng  to  context changes and integraAng new data sources and types

Page 20: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Big  Data  SoluAon  Requirements  

20

Cost-effectively manage the  volume,  variety  and  velocity  of  data  

Process and analyze large,  complex  data  sets…quickly  

Flexibly adapt to  context  changes  and  new  data  types  

Page 21: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

21

Google’s  Approach  to  Big  Data  Hadoop’s  Pedigree    

Page 22: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

A  Timeline  View  #1  

22

Page 23: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Google  File  System  

•  FoundaAon  of  scalable,  fail-­‐safe,  self-­‐healing  storage  •  One  central  place  of  truth  •  Cost-­‐effecAve  hardware  finally  available  

•  19”  Rack  servers  with  decent  amount  of  disk  space  

•  Handling  of  failures  built  in  •  Components  or  enAre  servers  •  At  scale  there  are  always  hardware  faults    

•  Simple  file  system  interface  •  Finally  no  need  for  expensive,  proprietary  systems  

23

Storage  

Page 24: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

MapReduce  

•  First  take  on  distributed  data  processing  framework  •  Same  concepts  as  Google  File  System,  i.e.  

•  Fail-­‐safe  and  scalable  •  Handles  a  wide  range  of  data  processing  problems  

•  BUT  not  all  of  them  (more  later)  •  Simple  API  reading  and  wriAng  Key/Value  pairs  •  Framework  handles  heavy  task  of  data  movement  •  Core  concept  is  data  locality,  heavy  I/O  

•  Brings  code  to  data,  not  the  opposite  (i.e.  no  HPC)  •  Accessible  in  many  programming  languages  

24

Processing  

Page 25: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

BigTable  

•  Adds  database  like  random  access  to  data  •  EffecAvely  a  Key/Value  store  with  table  semanAcs  •  Used  for  small  data  points  

•  Usually  less  than  a  megabyte  per  Key/Value  •  Forfeits  advanced  concepts  for  ease  of  scalability  

•  No  transacAons,  no  query  language  •  Powers  many  applicaAons  at  Google  •  Uses  Google  File  System  as  storage  layer  •  Tight  integraAon  with  MapReduce  for  batch  processing  

25

Random  Access  

Page 26: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Dremel,  Tenzing,  Pregel  

•  Dremel  adds  specific  file  format  and  query  language  •  Used  for  highly  selecAve  queries,  data  exploraAon  •  File  layout  is  opAmized  for  very  effecAve  scanning  •  Runs  alongside  of  MapReduce  and  File  System    

•  Tenzing  adds  SQL  over  various  data  sources  •  Can  query  raw  files,  Dremel  files,  or  BigTable  data  etc.  •  Brings  “known”  paradigm  to  stored  data  

•  Pregel  adds  graph  processing  API  

26

Query  API  

Page 27: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Percolator,  Megastore  

•  AddiAons  to  BigTable  to  add  “missing”  features  •  Percolator  is  using  BigTable  to  update  search  index  incrementally,  needs  transacAons  •  Distributes  updates  with  mulA-­‐phase  commits  

•  Megastore  drives  Google  App  Engine  to  also  add  transacAons  for  user  API  •  Uses  ranges  of  rows  as  en#ty  groups  •  Reduces  locking  to  small  subsets  •  OpAmisAc,  roll-­‐forward  only  transacAons  •  Java  layer  over  BigTable  API  

27

TransacAons  

Page 28: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Spanner,  F1  

•  Future  of  Google’s  distributed  storage  and  processing  system  

•  Spanner  is  a  scalable,  mulA-­‐version,  globally-­‐  distributed,  and  synchronously-­‐replicated  database  •  Replicates  across  datacenters  •  Uses  TrueTime  (atomic  clocks)  for  synchronizaAon  •  Uses  Colossus  for  storage  (a  GFS  successor)  

•  F1  replaced  MySQL  for  AdWords  service  •  SQL  over  data  stored  in  Spanner  •  Colocated  with  Spanner  processes  

28

World-­‐Wide  Data  

Page 29: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

29

The  Hadoop  Story  A  Eulogy  

Page 30: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

What  is  Apache  Hadoop?  

30

Has  the  Flexibility  to  Store  and  Mine  Any  Type  of  Data  

 §  Ask  quesAons  across  structured  and  

unstructured  data  that  were  previously  impossible  to  ask  or  solve  

§  Not  bound  by  a  single  schema  

Excels  at  Processing  Complex  Data  

 §  Scale-­‐out  architecture  divides  workloads  

across  mulAple  nodes  

§  Flexible  file  system  eliminates  ETL  boFlenecks  

Scales  Economically  

 §  Can  be  deployed  on  commodity  

hardware  

§  Open  source  plavorm  guards  against  vendor  lock  

Hadoop  Distributed  File  System  (HDFS)  

 Self-­‐Healing,  High  

Bandwidth  Clustered  Storage  

   

MapReduce/YARN    

Distributed  CompuAng  Framework  

Apache Hadoop  is  an  open  source  plavorm  for  data  storage  and  processing  that  is…  

ü  Scalable  ü  Fault  tolerant  ü  Distributed  

CORE  HADOOP  SYSTEM  COMPONENTS  

Page 31: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Core  Hadoop:  HDFS  

31

Self-healing, high bandwidth

1

2

3

4

5

2

4

5

HDFS

1

2

5

1

3

4

2

3

5

1

3

4

HDFS  breaks  incoming  files  into  blocks  and  stores  them  redundantly  across  the  cluster.  

Page 32: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Core  Hadoop:  MapReduce  

32

framework.

1

2

3

4

5

2

4

5

MR

1

2

5

1

3

4

2

3

5

1

3

4

Processes  large  jobs  in  parallel  across  many  nodes  and  combines  the  results.  

Page 33: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Why  Hadoop  Was  Created  

33

New opportunities to  derive  value  from    all  your  data.    

Exploding  Data  Volumes  &  Types  

Driving  The  Need  For  A  Flexible,  Scalable  SoluPon  

It’s difficult to handle data this diverse, at this scale. Traditional platforms can’t keep pace.

WEB  LOGS  

SOCIAL  MEDIA  

TRANSACTIONAL  DATA  

SMART  GRIDS  

OPERATIONAL  DATA  

DIGITAL  CONTENT  

R&D  DATA  

AD  IMPRESSIONS  

FILES  

•  Any  Kind  •  From  Any  Source  •  Structured  &  Unstructured  •  At  Scale  

•  Deep  Analysis  •  ExhausAve  &  Detailed  •  SophisAcated  Algorithms  •  Generate  Results  Quickly  

•  Extract More Value •  From More Data •  More Cost Effectively •  With Greater Flexibility

BIG  DATA  

HARD  PROBLEMS  

NEW OPPORTUNITIES

Page 34: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

The  Core  Values  of  Hadoop  

34

A platform for

§  Designed to store and process data at petabyte scale

§  Scale-out architecture increases capacity and processing power linearly

§  Perform operations in parallel across the entire cluster

§  Store data in any format – free from rigid schemas

§  Define context at the time you ask the question

§  Process and analyze data using virtually any programming language

§  Build out your cluster on your hardware of choice

§  Open source software guards against vendor lock-in

§  Wide integration ensures investment protection

1 2 3

Page 35: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Hadoop  In  PracAce  

35

Page 36: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

36  

Cloudera  Soaware  Stack  Turnkey  soluAon  for  Big  Data  and  Advanced  AnalyAcs  use-­‐cases  

   

CDH  100%  OPEN  SOURCE  HADOOP  DISTRIBUTION  

CLOUDERA  MANAGER  END-­‐TO-­‐END  SYSTEM  MANAGEMENT  

CORE  PROJECTS   PREMIUM  PROJECTS   CONNECTORS  

HDFS   MAPREDUCE   FLUME   HCATALOG  

MICROSTRATEGY  

NETEZZA  

ORACLE  

QLIKVIEW  

TABLEAU  

TERADATA  

HIVE   HUE   MAHOUT   OOZIE  

PIG   SQOOP   WHIRR   ZOOKEEPER  

HBASE  

IMPALA  

SEARCH  (BETA)  

DEPLOYMENT   MONITORING   API   SNMP   CONFIG  ROLLBACKS   PHONE  HOME  

SERVICE  MGMT   DIAGNOSTICS   ROLLING  UPGRADES   LDAP   REPORTING   BACKUP/DR  

CLOUDERA  SUPPORT  BEST-­‐IN-­‐CLASS  TECHNICAL  SUPPORT,  COMMUNICTY  ADVOCACY  &  INDEMNIFICATION  

CLOUDERA  NAVIGATOR  END-­‐TO-­‐END  DATA  MANAGEMENT  

ACCESS  MGMT   DATA  AUDIT  

CORE  HADOOP  PROJECTS  

CLOUDERA  MANAGER  

CLOUDERA  NAVIGATOR   HBASE   IMPALA  

Page 37: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

37

Spin  some  YARN!  Reborn  again!  

Page 38: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Back  to  the  Press  again…  

38

Source:  hFp://gigaom.com/2012/07/07/why-­‐the-­‐days-­‐are-­‐numbered-­‐for-­‐hadoop-­‐as-­‐we-­‐know-­‐it/  

Page 39: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

A  Timeline  View  #2  

39

Page 40: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

First:  What  is  MapReduce  1?  

40

Page 41: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

MoAvaAons  to  Change  MR1  

41

•  Scaling  >4000  nodes  •  Fewer,  larger  clusters  

•  No  single  source  of  truth,  data  in  “silos”  again  

•  HA  of  Job  Tracker  difficult  •  Large,  complex  state  

•  Poor  resource  uAlizaAon  •  Slots  in  MR1  are  for  either  map  or  reduce  

Page 42: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

YARN:  Yet  Another  Resource  NegoAator  

42

Page 43: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Split  of  ResponsibiliAes  

43

Job  Tracker  

Resource  Manager  

ApplicaAon  Master  

split  

•  One  per  Cluster  •  Long-­‐lived  •  App-­‐level  

•  One  per  app  instance  •  Short-­‐lived  •  Task-­‐level  scheduling  and  monitoring  

Page 44: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Fine-­‐grained  Resource  Control  

•  Node  Manager  is  a  generalized  Task  Tracker  

•  Task  Tracker  •  Fixed  number  of  map  and  reduce  slots  

•  Node  Manager  •  Containers  with  variable  resource  limits  

44

Page 45: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Node  Manager:  Containers  

45

Page 46: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

YARN  +  MapReduce  2  

46

•  YARN  “runs”  MapReduce  as  an  applicaAon  •  MR  is  user  space  •  YARN  is  kernel  

Page 47: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

YARN  ApplicaAons  

•  Distributed  shell  •  Open  MPI  •  Master-­‐worker  •  Apache  Giraph,  Hama  •  Spark  

47

Page 48: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

48

Summary  What  the  future  may  hold  

Page 49: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Enterprise  Data  EvoluAon  

RDBMS/EDW

HADOOP-OPTIMIZED INFRASTRUCTUREA

MO

UN

T O

F D

ATA

BUSINESS IMPACT

NEXT-GEN DATA COMPUTING PLATFORM

DATA-DRIVENORGANIZATION

AMOUNT  OF  DA

TA  

•  Data  collecAon  &  reporAng  

•  Process  data  faster  •  Store  data  more  cost-­‐effecAvely  •  Simplify  infrastructure  

•  Combine  data  from  across  the  business  •  Ask  new  quesAons  immediately  •  Enable  new  real-­‐Ame  applicaAons      

1980s   2000s   2010s  

CREATE  COMPETITIVE  ADVANTAGE  

IMPROVE  OPERATIONAL  EFFICIENCY  

Page 50: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Playing  Catchup  

•  Improve  overall  performance  •  Google’s  code  is  kernel  module,  C++,  as  low  as  possible  •  Hadoop  is  Java,  for  ease  of  development  in  open-­‐source  •  Maybe  rewrite  parts  of  the  stack?  •  Overall  goal:  saturate  machine  specs  (I/O,  CPU,  RAM)  

•  Add  missing  features  •  Everything  is  based  on  “hearsay”,  aka  research  papers  and  presentaAons  

•  Add  what  is  necessary  or  for  the  sake  of  it?  

50

Page 51: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Further  Extend  or  Invent?  

•  YARN  is  a  good  example  for  what  can  be  done  •  Look  at  every  component  and  evaluate  •  Work  with  research  and  universiAes,  companies  to  drive  new  development  

•  What  else  can  be  done  with  all  that  data?  

51

Page 52: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

52

—  Jim  Gray,  Computer  ScienAst  

Page 53: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

From  Framework  to  Plavorm  to  Commodity  

•  Hadoop  distribuAons  are  already  a  commodity  •  Move  up  the  stack  to  reach  commercial  space  

•  Simplify  data  processing  •  ConAnuuity  •  WibiData  (Kiji)  •  Cloudera  CDK  

•  Pure  Hadoop  SoluAons  •  DataMeer  •  Plavora  

53

Page 54: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

Hadoop…  live  long  and  prosper!  

54

Page 55: Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

 Lars  George,  EMEA  Chief  Architect,  Cloudera            @larsgeorge  

 

Thank  you!