may 29, 2014 toronto hadoop user group - micro etl

MICRO-‐ETL: AN EMERGING PATTERN OF USE WITH HADOOP AND NEAR-‐REALTIME FRAMEWORKS

Adam Muise – Principle Architect, Hortonworks

Who is ?

We do Hadoop

The leaders of Hadoop’s development

Community driven, Enterprise Focused

Drive InnovaEon in the plaForm – We lead the roadmap

100% Open Source – DemocraEzed Access to Data

What is Micro-‐ETL?

I made it up.

It turns out several other people have made it up before so I don’t

feel like a megalomaniac. Another terminology calls it Near-‐

RealEme ETL.

hTp://www.researchgate.net/publicaEon/226219087_Near_Real_Time_ETL/file/79e4150b23b3aca5aa.pdf

Micro-‐ETL/Near-‐RealEme ETL involves: 1. An intra-‐batch and/or near-‐real3me ingest of analy3cal data 2. Processing on a small batches of data or event streams 3. A scalable “ETL” processing framework

Why Use Micro-‐ETL: 1. Your analy3cs require more up to date informa3on than provided by a regular batch process 2. There is opera3onal risk (ie falling behind the data firehose, data loss, data fidelity) in leaving the processing of events un3l batch can process them 3. Your data ingest rate is inconsistent and there is value in keeping up-‐to-‐date with current events.

When not to use Micro-‐ETL: 1. Your data ingest rates are predictable and analyzing them intra-‐batch provides liMle value. 2. Processing in a large batch yields a complete popula3on of the data required to make a decision. 3. You have exis3ng investments in tradi3onal ETL tools that outweigh any benefits in a new tool/framework (as tools evolve, this will be a moot point however)

Micro-‐ETL involves regular processing tasks only run on data

frameworks that can handle near-‐realEme event streams

Let’s refresh on some core Hadoop concepts…

Refresh on YARN

YARN = Yet Another Resource NegoEator

Resource Manager +

Node Managers = YARN

Resource Manager

AppMaster

Node Manager

Scheduler

AppMaster

AppMaster

Node Manager

Node Manager

Node Manager

Container

Container

MapReduce

Container

Storm

Container

Container

Container

Pig

Container

Container

Container

YARN abstracts resource management so you can run all sorts

of distributed applicaEons

HDFS

MapReduce V2

YARN MapReduce V? STORM

MPI Giraph HBase Tez … and more Spark

The following secEon outlines frameworks that are emerging as data processing opEons to batch-‐

driven MapReduce

Introducing Tez

Three important facts about Tez: 1. Tez is a YARN applicaEon. 2. Tez will eventually replace

MapRecue. 3. Tez scales as well as the rest of

Hadoop scales (thousands of nodes).

Tez provides a layer for abstract tasks, these could be mappers, reducers, customized stream

processes, in memory structures, etc

Tez chains tasks together into one job to get jobs like Map – Reduce – Reduce.

This is ideal for apps like Hive.

TezMap

TezMap

TezMap

TezMap

TezMap

TezReduce

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

TezReduce

TezReduce

TezReduce

TezReduce

TezReduce

Group By

ProjecEons

Order By

Tez provides the opEon for more complicated workflows

Tez allows for more complicated workflow primiEves than just Map and Reduce. A Tez task is composed of a programmable Input, Output,

and Processor

YARN can provide long-‐running containers* for applicaEons like

Storm, Hbase, JBoss, etc

* -‐ With the help of Apache Slider: hTp://wiki.apache.org/incubator/SliderProposal

Yahoo!, the original author of Hadoop and the largest Hadoop

user, is bepng on Apache Hive/Tez/YARN as their core data

architecture.

hTp://yahoodevelopers.tumblr.com/post/85930551108/yahoo-‐bepng-‐on-‐apache-‐hive-‐tez-‐and-‐yarn

Refresh on Storm

Storm is a distributed execuEon engine that handles streaming data

Storm processes streaming event data as tuples. Each event is

generated/ingested through a spout and processed in series of bolts. The spouts and bolts form a topology.

Refresh on Summingbird

Summingbird is a processing framework that runs over Storm. It uses Scala and has MapReduce-‐like features. Technically it’s Scalding

(Cascading with Scala).

Refresh on Spark

Spark is a framework designed to handle in-‐memory compuEng. If you are using Hadoop, you are typically

running Spark on YARN.

Spark uses RDDs (Resilient Distributed Datasets) as a primiEve to enable in-‐memory processing. RDDs can be created from all sorts

of data, like HDFS files:

scala> val distFile = sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt") distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08

Spark Streaming constructs DStream primiEves from a streaming data

source. DStreams are actually made up of many RDDs and use the

common Spark Engine.

Spark Streaming has an expressive API to allow typical transformaEons

on RDDs or DStreams. This is comparable to MapReduce.

Since Micro-‐ETL involves porEng tradiEonal data processing tasks to different execuEon environments, it makes sense that data processing tools would facilitate execuEon on

mulEple plaForms.

Some opEons in the field…

You can write generic processing libraries (Java) for your data and

port them from MapReduce/Tez to Storm. Storm can also run Summingbird (Scala).

Spark already provides a streaming and processing layer for Micro-‐ETL. These can be

Scala, Java, or Python.

Cascading has recently announced that it would support Tez and Storm immediately in version 3.0. They

also have plans for Spark

hTp://www.concurrenEnc.com/2014/05/cascading-‐3-‐0-‐adds-‐mulEple-‐framework-‐support-‐concurrent-‐driven-‐manages-‐big-‐data-‐apps/

Pentaho KeTle currently supports porEng their data workflows to Storm in a beta

version.

hTp://wiki.pentaho.com/display/BAD/KeTle+ExecuEon+on+Storm

Talend recently asked for $40 million in funding to help push

further into Big Data. That includes tooling to port Talend ETL workflows

to Storm and Tez.

hTp://techcrunch.com/2013/12/11/talend-‐raises-‐40m-‐to-‐more-‐aggressively-‐extend-‐into-‐big-‐data-‐market-‐sets-‐sights-‐on-‐ipo/

Discuss.

Thanks THUGs.

may 29, 2014 toronto hadoop user group - micro etl

Data & Analytics