applications on hadoop

24
1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Building ApplicaCons on Hadoop Mark Grover SoFware Engineer, Cloudera @mark_grover Jfokus 2014 (February 4 th , 2014) ©2014 Cloudera, Inc. All Rights Reserved.

Upload: markgrover

Post on 27-Jan-2015

110 views

Category:

Technology


1 download

DESCRIPTION

Applications on Hadoop talk by Mark Grover at JFokus 2014 on February 4, 2014 in Stockholm, Sweden.

TRANSCRIPT

Page 1: Applications on Hadoop

1

Headline  Goes  Here  Speaker  Name  or  Subhead  Goes  Here  

DO  NOT  USE  PUBLICLY  PRIOR  TO  10/23/12  Building  ApplicaCons  on  Hadoop  

Mark  Grover  SoFware  Engineer,  Cloudera  @mark_grover  Jfokus  2014  (February  4th,  2014)    

©2014 Cloudera, Inc. All Rights Reserved.

Page 2: Applications on Hadoop

Agenda  

•  Brief  intro  to  Hadoop  and  the  ecosystem  • Developing  apps  on  Hadoop  

• What’s  the  current  problem?  •  How  are  we  fixing  it?  

2 ©2014 Cloudera, Inc. All Rights Reserved.

Page 3: Applications on Hadoop

What  is  Apache  Hadoop?  

3

Has  the  Flexibility  to  Store  and  Mine  Any  Type  of  Data  

 §  Ask  quesCons  across  structured  and  

unstructured  data  that  were  previously  impossible  to  ask  or  solve  

§  Not  bound  by  a  single  schema  

Excels  at  Processing  Complex  Data  

 §  Scale-­‐out  architecture  divides  workloads  

across  mulCple  nodes  

§  Flexible  file  system  eliminates  ETL  bo^lenecks  

Scales  Economically  

 §  Can  be  deployed  on  commodity  

hardware  

§  Open  source  pla_orm  guards  against  vendor  lock  

Hadoop  Distributed  File  System  (HDFS)  

 Self-­‐Healing,  High  

Bandwidth  Clustered  Storage  

   

MapReduce    

Distributed  CompuCng  Framework  

Apache Hadoop  is  an  open  source  pla_orm  for  data  storage  and  processing  that  is…  

ü  Scalable  ü  Fault  tolerant  ü  Distributed  

CORE  HADOOP  SYSTEM  COMPONENTS  

©2014 Cloudera, Inc. All Rights Reserved.

Page 4: Applications on Hadoop

4

Kite  SDK  

Developing  apps  on  Hadoop  

©2014 Cloudera, Inc. All Rights Reserved.

Page 5: Applications on Hadoop

5 2

“[I]t’s  not  enough  to  just  build  a  scalable  and  stable  system;  the  system  also  has  to  be  easy  enough  for  thousands  of  internal  developers  of  all  types  and  all  skill  levels  to  use.”  

h^p://gigaom.com/data/how-­‐disney-­‐built-­‐a-­‐big-­‐data-­‐pla_orm-­‐on-­‐a-­‐startup-­‐budget/  

Page 6: Applications on Hadoop

Hadoop  is  incredibly  powerful  

©2014 Cloudera, Inc. All Rights Reserved. 6

Page 7: Applications on Hadoop

Hadoop  is  incredibly  flexible  

©2014 Cloudera, Inc. All Rights Reserved. 7

Page 8: Applications on Hadoop

Hadoop  is  incredibly  low-­‐level  

©2014 Cloudera, Inc. All Rights Reserved. 8

Page 9: Applications on Hadoop

Hadoop  is  incredibly  complex  

©2014 Cloudera, Inc. All Rights Reserved. 9

Page 10: Applications on Hadoop

A  typical  system  (zoom  100:1)  

©2014 Cloudera, Inc. All Rights Reserved. 10

Page 11: Applications on Hadoop

A  typical  system  (zoom  10:1)  

©2014 Cloudera, Inc. All Rights Reserved. 11

Page 12: Applications on Hadoop

A  typical  system  (zoom  5:1)  

©2014 Cloudera, Inc. All Rights Reserved. 12

Page 13: Applications on Hadoop

What  you  actually  care  about  

• Gelng  data  from  A  to  B  • Using  it  later  

©2014 Cloudera, Inc. All Rights Reserved. 13

Page 14: Applications on Hadoop

Infrastructure  details  

•  SerializaCon,  file  formats,  and  compression  • Metadata  capture  and  maintenance  • Dataset  organizaCon  and  parCConing  • Durability  and  delivery  guarantees  • Well-­‐defined  failure  semanCcs  •  Performance  and  health  instrumentaCon  

©2014 Cloudera, Inc. All Rights Reserved. 14

Page 15: Applications on Hadoop

Kite  SDK  

• Make  Hadoop  accessible  to  the  enterprise  developer  • Address  the  most  common  cases  •  Codify  expert  pa^erns  and  pracCces  for  building  data-­‐oriented  systems  and  applicaCons.  

•  Let  developers  focus  on  business  logic,  not  plumbing  or  infrastructure.  

•  Provide  smart  defaults  for  pla_orm  choices.  •  Support  piecemeal  adopCon  via  loosely-­‐coupled  modules  

©2014 Cloudera, Inc. All Rights Reserved. 15

Page 16: Applications on Hadoop

Kite  SDK  

• An  open  source  set  of  libraries,  guides,  and  examples  for  building  data-­‐oriented  systems  and  applicaCons  

•  Provides  higher  level  APIs  atop  exisCng  components  of  CDH  •  Supports  piecemeal  adopCon  via  loosely  coupled  modules  

©2014 Cloudera, Inc. All Rights Reserved. 16

Page 17: Applications on Hadoop

Kite  SDK  •  Data  –  logical  abstracCons  of  records,  datasets  and  repositories  with  implementaCons  for  HDFS  and  HBase  (upcoming)  

•  APIs  to  drasCcally  simplify  working  with  datasets  in  Hadoop  filesystems.  The  Data  module:  •   handles  automaCc  serializaCon  and  deserializaCon  of  Java  POJOs  as  well  as  Avro  Records.  •  AutomaCc  compression.  •  File  and  directory  layout  and  management.  •  AutomaCc  parCConing  based  on  configurable  funcCons.  •  A  metadata  provider  plugin  interface  to  integrate  with  centralized  metadata  management  systems.    

•  Morphlines  –  declaraCve  ETL  stream  processing  library    •  Maven  Plugin  –  tools  for  working  with  datasets  and  running  jobs  

•  TODO:  Add  more!!!  

©2014 Cloudera, Inc. All Rights Reserved. 17

Page 18: Applications on Hadoop

Co-­‐authoring  O’Reilly  book  

•  Titled  ‘Hadoop  ApplicaCon  Architectures’  • How  to  build  end-­‐to-­‐end  soluCons  using    Apache  Hadoop  and  related  tools  • Updates  on  Twi^er:  @hadooparchbook  •  h^p://www.hadooparchitecturebook.com/  

©2014 Cloudera, Inc. All Rights Reserved. 18

Page 19: Applications on Hadoop

19 15

DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get();Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get());DatasetWriter<GenericRecord> writer = events.getWriter();writer.open();writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build());writer.close();

/data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro

Code  

Data  

Page 20: Applications on Hadoop

20

Kite  SDK  Morphlines  Module  

Pluggable,  configuraCon-­‐driven  data  transform  library  Born  out  of  Cloudera  Search,  but  general  purpose  Configure  record  transform  stages  in  a  container  library  Use  the  library  in  Flume,  MapReduce  jobs,  Storm,  and  other  Java  applicaCons  

14

Page 21: Applications on Hadoop

21

Other  Modules  

Maven  plugin  Package,  deploy,  and  execute  “apps”  Execute  dataset  operaCons  

Examples  POJO,  generic,  and  generated  enCty  ingest  Dataset  administraCve  operaCons  Crunch  and  MR  integraCon  ...  

14

Page 22: Applications on Hadoop

22

Future  

HBase  Extending  data  APIs  to  support  random  access  Same  automaCc  serializaCon,  schema  management,  etc.  

Higher-­‐order  data  management  Common  tasks  Think  background  compacCon,  conversion,  etc.  

IntegraCon  with  exisCng  middleware  frameworks  Give  us  all  your  good  ideas  (and  code)!  

14

Page 23: Applications on Hadoop

Kite  SDK  Resources  

•  Docs  •  h^p://kitesdk.org/docs/current/  

•  Examples  •  h^ps://github.com/kite-­‐sdk/kite-­‐examples  

•  Source  code  •  h^ps://github.com/kite-­‐sdk/  

Binary  arCfacts  available  from  Cloudera’s  Maven  repository  •  Twi^er:  @mark_grover  •  Slides  at    •  LinkedIn:  linkedin.com/in/grovermark  

©2014 Cloudera, Inc. All Rights Reserved. 23

Page 24: Applications on Hadoop

24 18