may 29, 2014 toronto hadoop user group - micro etl

43
MICROETL: AN EMERGING PATTERN OF USE WITH HADOOP AND NEARREALTIME FRAMEWORKS Adam Muise – Principle Architect, Hortonworks

Upload: adam-muise

Post on 11-Aug-2014

357 views

Category:

Data & Analytics


0 download

DESCRIPTION

This is a brief discussion about the various technologies utilized for micro-ETL aka near-realtime ETL

TRANSCRIPT

Page 1: May 29, 2014 Toronto Hadoop User Group - Micro ETL

MICRO-­‐ETL:  AN  EMERGING  PATTERN  OF  USE  WITH  HADOOP  AND  NEAR-­‐REALTIME  FRAMEWORKS  

Adam  Muise  –  Principle  Architect,  Hortonworks  

Page 2: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Who  is                                        ?  

Page 3: May 29, 2014 Toronto Hadoop User Group - Micro ETL

We  do  Hadoop  

The  leaders  of  Hadoop’s  development  

Community  driven,    Enterprise  Focused  

Drive  InnovaEon  in  the  plaForm  –  We  lead  the  roadmap    

100%  Open  Source  –  DemocraEzed  Access  to  Data  

Page 4: May 29, 2014 Toronto Hadoop User Group - Micro ETL

What  is  Micro-­‐ETL?  

Page 5: May 29, 2014 Toronto Hadoop User Group - Micro ETL

I  made  it  up.  

Page 6: May 29, 2014 Toronto Hadoop User Group - Micro ETL

It  turns  out  several  other  people  have  made  it  up  before  so  I  don’t  

feel  like  a  megalomaniac.  Another  terminology  calls  it  Near-­‐

RealEme  ETL.    

hTp://www.researchgate.net/publicaEon/226219087_Near_Real_Time_ETL/file/79e4150b23b3aca5aa.pdf  

Page 7: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Micro-­‐ETL/Near-­‐RealEme  ETL  involves:  1.  An  intra-­‐batch  and/or  near-­‐real3me  ingest  of  analy3cal  data    2.  Processing  on  a  small  batches  of  data  or  event  streams    3.  A  scalable  “ETL”  processing  framework    

Page 8: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Why  Use  Micro-­‐ETL:  1.  Your  analy3cs  require  more  up  to  date  informa3on  than  provided  by  a  regular  batch  process  2.  There  is  opera3onal  risk  (ie  falling  behind  the  data  firehose,  data  loss,  data  fidelity)  in  leaving  the  processing  of  events  un3l  batch  can  process  them  3.  Your  data  ingest  rate  is  inconsistent  and  there  is  value  in  keeping  up-­‐to-­‐date  with  current  events.  

Page 9: May 29, 2014 Toronto Hadoop User Group - Micro ETL

When  not  to  use  Micro-­‐ETL:  1.  Your  data  ingest  rates  are  predictable  and  analyzing  them  intra-­‐batch  provides  liMle  value.  2.  Processing  in  a  large  batch  yields  a  complete  popula3on  of  the  data  required  to  make  a  decision.    3.  You  have  exis3ng  investments  in  tradi3onal  ETL  tools  that  outweigh  any  benefits  in  a  new  tool/framework  (as  tools  evolve,  this  will  be  a  moot  point  however)  

Page 10: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Micro-­‐ETL  involves  regular  processing  tasks  only  run  on  data  

frameworks  that  can  handle  near-­‐realEme  event  streams  

Page 11: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Let’s  refresh  on  some  core  Hadoop  concepts…  

Page 12: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Refresh  on  YARN  

Page 13: May 29, 2014 Toronto Hadoop User Group - Micro ETL

YARN  =  Yet  Another  Resource  NegoEator  

Page 14: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Resource  Manager  +  

Node  Managers  =  YARN  

Resource  Manager  

AppMaster  

Node  Manager  

Scheduler  

AppMaster  

AppMaster  

Node  Manager  

Node  Manager  

Node  Manager  

Container  

Container  

MapReduce  

Container  

Storm  

Container  

Container  

Container  

Pig  

Container  

Container  

Container  

Page 15: May 29, 2014 Toronto Hadoop User Group - Micro ETL

YARN  abstracts  resource  management  so  you  can  run  all  sorts  

of  distributed  applicaEons  

HDFS  

MapReduce  V2  

YARN  MapReduce  V?   STORM  

MPI  Giraph   HBase  Tez   …  and  more  Spark  

Page 16: May 29, 2014 Toronto Hadoop User Group - Micro ETL

The  following  secEon  outlines  frameworks  that  are  emerging  as  data  processing  opEons  to  batch-­‐

driven  MapReduce  

Page 17: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Introducing  Tez  

Page 18: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Three  important  facts  about  Tez:  1.  Tez  is  a  YARN  applicaEon.  2.  Tez  will  eventually  replace  

MapRecue.  3.  Tez  scales  as  well  as  the  rest  of  

Hadoop  scales  (thousands  of  nodes).  

Page 19: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Tez  provides  a  layer  for  abstract  tasks,  these  could  be  mappers,  reducers,  customized  stream  

processes,  in  memory  structures,  etc  

Page 20: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Tez  chains  tasks  together  into  one  job  to  get  jobs  like  Map  –  Reduce  –  Reduce.    

This  is  ideal  for  apps  like  Hive.  

TezMap  

TezMap  

TezMap  

TezMap  

TezMap  

TezReduce  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

TezReduce  

TezReduce  

TezReduce  

TezReduce  

TezReduce  

Group  By  

ProjecEons  

Order  By  

Page 21: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Tez  provides  the  opEon  for  more  complicated  workflows  

Page 22: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Tez  allows  for  more  complicated  workflow  primiEves  than  just  Map  and  Reduce.  A  Tez  task  is  composed  of  a  programmable  Input,  Output,  

and  Processor  

Page 23: May 29, 2014 Toronto Hadoop User Group - Micro ETL

YARN  can  provide  long-­‐running  containers*  for  applicaEons  like  

Storm,  Hbase,  JBoss,  etc    

*  -­‐  With  the  help  of  Apache  Slider:    hTp://wiki.apache.org/incubator/SliderProposal  

Page 24: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Yahoo!,  the  original  author  of  Hadoop  and  the  largest  Hadoop  

user,  is  bepng  on  Apache  Hive/Tez/YARN  as  their  core  data  

architecture.      

hTp://yahoodevelopers.tumblr.com/post/85930551108/yahoo-­‐bepng-­‐on-­‐apache-­‐hive-­‐tez-­‐and-­‐yarn  

Page 25: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Refresh  on  Storm  

Page 26: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Storm  is  a  distributed  execuEon  engine  that  handles  streaming  data  

Page 27: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Storm  processes  streaming  event  data  as  tuples.  Each  event  is  

generated/ingested  through  a  spout  and  processed  in  series  of  bolts.  The  spouts  and  bolts  form  a  topology.  

Page 28: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Refresh  on  Summingbird  

Page 29: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Summingbird  is  a  processing  framework  that  runs  over  Storm.  It  uses  Scala  and  has  MapReduce-­‐like  features.  Technically  it’s  Scalding  

(Cascading  with  Scala).  

Page 30: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Refresh  on  Spark  

Page 31: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Spark  is  a  framework  designed  to  handle  in-­‐memory  compuEng.  If  you  are  using  Hadoop,  you  are  typically  

running  Spark  on  YARN.  

Page 32: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Spark  uses  RDDs  (Resilient  Distributed  Datasets)  as  a  primiEve  to  enable  in-­‐memory  processing.  RDDs  can  be  created  from  all  sorts  

of  data,  like  HDFS  files:  

scala>  val  distFile  =  sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt")    distFile:  spark.RDD[String]  =  spark.HadoopRDD@1d4cee08  

Page 33: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Spark  Streaming  constructs  DStream  primiEves  from  a  streaming  data  

source.  DStreams  are  actually  made  up  of  many  RDDs  and  use  the  

common  Spark  Engine.    

Page 34: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Spark  Streaming  has  an  expressive  API  to  allow  typical  transformaEons  

on  RDDs  or  DStreams.  This  is  comparable  to  MapReduce.  

Page 35: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Since  Micro-­‐ETL  involves  porEng  tradiEonal  data  processing  tasks  to  different  execuEon  environments,  it  makes  sense  that  data  processing  tools  would  facilitate  execuEon  on  

mulEple  plaForms.    

Page 36: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Some  opEons  in  the  field…  

Page 37: May 29, 2014 Toronto Hadoop User Group - Micro ETL

You  can  write  generic  processing  libraries  (Java)  for  your  data  and  

port  them  from  MapReduce/Tez  to  Storm.  Storm  can  also  run  Summingbird  (Scala).    

Page 38: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Spark  already  provides  a  streaming  and  processing  layer  for  Micro-­‐ETL.  These  can  be  

Scala,  Java,  or  Python.  

Page 39: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Cascading  has  recently  announced  that  it  would  support  Tez  and  Storm  immediately  in  version  3.0.  They  

also  have  plans  for  Spark  

hTp://www.concurrenEnc.com/2014/05/cascading-­‐3-­‐0-­‐adds-­‐mulEple-­‐framework-­‐support-­‐concurrent-­‐driven-­‐manages-­‐big-­‐data-­‐apps/  

Page 40: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Pentaho  KeTle  currently  supports  porEng  their  data  workflows  to  Storm  in  a  beta  

version.  

hTp://wiki.pentaho.com/display/BAD/KeTle+ExecuEon+on+Storm  

Page 41: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Talend  recently  asked  for  $40  million  in  funding  to  help  push  

further  into  Big  Data.  That  includes  tooling  to  port  Talend  ETL  workflows  

to  Storm  and  Tez.  

hTp://techcrunch.com/2013/12/11/talend-­‐raises-­‐40m-­‐to-­‐more-­‐aggressively-­‐extend-­‐into-­‐big-­‐data-­‐market-­‐sets-­‐sights-­‐on-­‐ipo/  

Page 42: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Discuss.  

Page 43: May 29, 2014 Toronto Hadoop User Group - Micro ETL

Thanks  THUGs.