may 29, 2014 toronto hadoop user group - micro etl
DESCRIPTION
This is a brief discussion about the various technologies utilized for micro-ETL aka near-realtime ETLTRANSCRIPT
MICRO-‐ETL: AN EMERGING PATTERN OF USE WITH HADOOP AND NEAR-‐REALTIME FRAMEWORKS
Adam Muise – Principle Architect, Hortonworks
Who is ?
We do Hadoop
The leaders of Hadoop’s development
Community driven, Enterprise Focused
Drive InnovaEon in the plaForm – We lead the roadmap
100% Open Source – DemocraEzed Access to Data
What is Micro-‐ETL?
I made it up.
It turns out several other people have made it up before so I don’t
feel like a megalomaniac. Another terminology calls it Near-‐
RealEme ETL.
hTp://www.researchgate.net/publicaEon/226219087_Near_Real_Time_ETL/file/79e4150b23b3aca5aa.pdf
Micro-‐ETL/Near-‐RealEme ETL involves: 1. An intra-‐batch and/or near-‐real3me ingest of analy3cal data 2. Processing on a small batches of data or event streams 3. A scalable “ETL” processing framework
Why Use Micro-‐ETL: 1. Your analy3cs require more up to date informa3on than provided by a regular batch process 2. There is opera3onal risk (ie falling behind the data firehose, data loss, data fidelity) in leaving the processing of events un3l batch can process them 3. Your data ingest rate is inconsistent and there is value in keeping up-‐to-‐date with current events.
When not to use Micro-‐ETL: 1. Your data ingest rates are predictable and analyzing them intra-‐batch provides liMle value. 2. Processing in a large batch yields a complete popula3on of the data required to make a decision. 3. You have exis3ng investments in tradi3onal ETL tools that outweigh any benefits in a new tool/framework (as tools evolve, this will be a moot point however)
Micro-‐ETL involves regular processing tasks only run on data
frameworks that can handle near-‐realEme event streams
Let’s refresh on some core Hadoop concepts…
Refresh on YARN
YARN = Yet Another Resource NegoEator
Resource Manager +
Node Managers = YARN
Resource Manager
AppMaster
Node Manager
Scheduler
AppMaster
AppMaster
Node Manager
Node Manager
Node Manager
Container
Container
MapReduce
Container
Storm
Container
Container
Container
Pig
Container
Container
Container
YARN abstracts resource management so you can run all sorts
of distributed applicaEons
HDFS
MapReduce V2
YARN MapReduce V? STORM
MPI Giraph HBase Tez … and more Spark
The following secEon outlines frameworks that are emerging as data processing opEons to batch-‐
driven MapReduce
Introducing Tez
Three important facts about Tez: 1. Tez is a YARN applicaEon. 2. Tez will eventually replace
MapRecue. 3. Tez scales as well as the rest of
Hadoop scales (thousands of nodes).
Tez provides a layer for abstract tasks, these could be mappers, reducers, customized stream
processes, in memory structures, etc
Tez chains tasks together into one job to get jobs like Map – Reduce – Reduce.
This is ideal for apps like Hive.
TezMap
TezMap
TezMap
TezMap
TezMap
TezReduce
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
TezReduce
TezReduce
TezReduce
TezReduce
TezReduce
Group By
ProjecEons
Order By
Tez provides the opEon for more complicated workflows
Tez allows for more complicated workflow primiEves than just Map and Reduce. A Tez task is composed of a programmable Input, Output,
and Processor
YARN can provide long-‐running containers* for applicaEons like
Storm, Hbase, JBoss, etc
* -‐ With the help of Apache Slider: hTp://wiki.apache.org/incubator/SliderProposal
Yahoo!, the original author of Hadoop and the largest Hadoop
user, is bepng on Apache Hive/Tez/YARN as their core data
architecture.
hTp://yahoodevelopers.tumblr.com/post/85930551108/yahoo-‐bepng-‐on-‐apache-‐hive-‐tez-‐and-‐yarn
Refresh on Storm
Storm is a distributed execuEon engine that handles streaming data
Storm processes streaming event data as tuples. Each event is
generated/ingested through a spout and processed in series of bolts. The spouts and bolts form a topology.
Refresh on Summingbird
Summingbird is a processing framework that runs over Storm. It uses Scala and has MapReduce-‐like features. Technically it’s Scalding
(Cascading with Scala).
Refresh on Spark
Spark is a framework designed to handle in-‐memory compuEng. If you are using Hadoop, you are typically
running Spark on YARN.
Spark uses RDDs (Resilient Distributed Datasets) as a primiEve to enable in-‐memory processing. RDDs can be created from all sorts
of data, like HDFS files:
scala> val distFile = sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt") distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
Spark Streaming constructs DStream primiEves from a streaming data
source. DStreams are actually made up of many RDDs and use the
common Spark Engine.
Spark Streaming has an expressive API to allow typical transformaEons
on RDDs or DStreams. This is comparable to MapReduce.
Since Micro-‐ETL involves porEng tradiEonal data processing tasks to different execuEon environments, it makes sense that data processing tools would facilitate execuEon on
mulEple plaForms.
Some opEons in the field…
You can write generic processing libraries (Java) for your data and
port them from MapReduce/Tez to Storm. Storm can also run Summingbird (Scala).
Spark already provides a streaming and processing layer for Micro-‐ETL. These can be
Scala, Java, or Python.
Cascading has recently announced that it would support Tez and Storm immediately in version 3.0. They
also have plans for Spark
hTp://www.concurrenEnc.com/2014/05/cascading-‐3-‐0-‐adds-‐mulEple-‐framework-‐support-‐concurrent-‐driven-‐manages-‐big-‐data-‐apps/
Pentaho KeTle currently supports porEng their data workflows to Storm in a beta
version.
hTp://wiki.pentaho.com/display/BAD/KeTle+ExecuEon+on+Storm
Talend recently asked for $40 million in funding to help push
further into Big Data. That includes tooling to port Talend ETL workflows
to Storm and Tez.
hTp://techcrunch.com/2013/12/11/talend-‐raises-‐40m-‐to-‐more-‐aggressively-‐extend-‐into-‐big-‐data-‐market-‐sets-‐sights-‐on-‐ipo/
Discuss.
Thanks THUGs.