hadoop data transformation

Hadoop “Data Transformation”

[about BigData] “Size matters not. Loot at me”

[about this presentation] “Became a little bit more knowledgeable, you will. Fear have not, your mind just open…” (could be) Yoda – Star Wars

“The dark side clouds everything. Impossible to see the future is.” (actual quote) Yoda – Star Wars

Constantin CiureanuSoftware Architect ([email protected])Project “Hadoop Data Transformation”

Hadoop - outro

Hadoop MapReduce framework is:•Simple (but not trivial) yet powerful•Used to process large amounts of data in•Highly efficient, Scalable, fault-tolerant•Incredibly useful (It actually solves a lot of problems)•Has proven to be an invaluable tool for many companies over the past few years

Well, … surprise! Plain MapReduce is not sufficient for Optymyze!

Why? Because the abstraction level of Optymyze “Data Transformation” could bring any Hadoop developer to its knees (or worse, see the next slide)!

“May the force be with you!”

Cascading – to the rescue

Cascading framework allows for easy development of complex Hadoop MapReduce jobs

•A Flow is a sequence of manipulations on pipes which are flowing tuple streams•Any flow is reusable and can be used to compose bigger flows•At runtime, each flow is transformed in one or more MapReduce jobs (which are executed in dependencies order)

Cascading

Optymyze Data

Transformation

Cascading

Hadoop

Cascading – details

Cascading – details

• Cascading represents data as Tuples (a single row of data to be processed)

(“A”, 100)

(“B”, 140)

• Inputs and Outputs are called Taps, using one of the following schemes: TextLine, TextDelimited, SequenceFile, WritableSequenceFile

• Each Tap produces or receives a pipe of tuples with the same format• If it’s needed, there is support for multiple inputs / multiple outputs• Cascading has integration with Avro, HBase, JDBC, MemCached, SimpleDB etc.

• Cascading is very flexible, offers a lot of functionality and can be easily extended

Cascading - example

[“letter”,”occurrences”]

(“A”, 1)

(“B”, 5)

(“A”, 2)

(“B”, 6)

(“A”, 3)

By applying a GroupBy(“letter”) produces:

(“A”, [1,2,3])

(“B”, [5,6])

A subsequent Sum(“occurrences”) produces:

[“letter”, “sum”]

(“A”, 6)

(“B”, 11)

Cascading - features

• Each – executes some processing for each line in the input tuples stream• GroupBy – (similar to SQL) groups the lines from input using a set of grouping fields

(task achieved in the Reduce phase, while the Map phase if just decorating the output with some grouping flags)

• CoGroup – join together some tuple streams• Every – executes some processing (aggregator, buffer) for a group of lines in the

input (group being generated by a CoGroup of GroupBy)• HashJoin – allows for two or more tuple streams to join into a single stream via a

Joiner when all but one tuple stream is considered small enough to fit into memory. • Split - takes a single stream and sends it down via multiple paths • Merge - allows for multiple streams, with the same fields to be merged back into a

single stream.

Hadoop Data Transformation

• A job = List<Process>– DataLoad – a process– DataTransformation process = List<Operation>

• Aggregate Records• Post Records• Rank Records• Calculate Fields• Calculate Running Fields• Merge Fields• Merge Records• Roll Up Using Hierarchy

– DataExport – a process

Data Load

• Copies the file to HDFS• Executes a Cascading job that checks for duplicates• Stores the result on HDFS

Hadoop Data Transformation

File(s)File(s)File(s)

File(s) to be

loaded to Oracle

Hadoop HDFS

FolderPart-filePart-filePart-filePart-file

DT Job

DT Process

DT Process

DT Process

FolderPart-filePart-filePart-filePart-file

PostRecords

Oracle

Data Transformation - Operation

Input Tap - according to operation input type:•TABLE – use a JDBC Tap•OPERATION – either:

– an Output pipe – from a previous operation– a HDFS folder

•(optional) Input Filter•The actual processing block for this operation•(optional) Output Filter•Output Tap:

– operation – in this case the output is stored on HDFS as a folder

– PostOperation – an Oracle destination table, 2 ways to get data there:

• JDBC output tap (batch inserts rows into destination table)• HDFS output tap (to store the output as a HDFS folder. At the end the

destination table will be loaded using “external” table)

[ Filter ]

Operation

[ Filter ]

Aggregate Records

• Input pipe (with optional input filter)• GroupBy – on grouping fields from AggregateRecords definition• As many Every steps as needed (one for each AggregateCalculation),

each of them iterating over the GroupBy results adding an Aggregator (see below)

• Output pipe (with optional output filter)

• Aggregators (applicable for all the values in the relevant column = current group values)

– SumAggregator– AverageAggregator– Last(Character)Aggregator– Max(Character)Aggregator– Min(Character)Aggregator– CountAggregator– CountDistinctAggregator– MedianAggregator– StdDevAggregator with sample and population option

•

[ Filter ]

[ Filter ]

GroupBy

Every

Aggr

egat

or

Aggr

egat

or

Aggr

egat

or

Calculate Fields

• Input pipe (with optional input filter)• Each – with a CustomCalculateFields Function including all calculation

definitions. • A calculation definition might:

– create a new field in the input tuple (that can be later on reused in subsequent formulas)– overwrite an existing field with a new value

• Each calculated field is a “math tree” – allows to define recursively the calculation formula (existing input fields & values are used in functions or by other calculated fields)

• Output pipe (with optional output filter)

[ Filter ]

[ Filter ]

Each

Function

Calc

ulati

on

Calc

ulati

on

Calc

ulati

on

Merge Fields

• Input pipe(s) (with optional input filter)• A CoGroup – using the definition MatchingCondition,

placing together all the left and right hand side columns• A Discard pipe – that will remove duplicated columns (the

ones used in the join)• Output pipe (with optional output filter)

[ Filter ]

[ Filter ]

CoGroup

Discard

[ Filter ]

Rank Records

• Input pipe (with optional input filter)• For all RankingCalculations:

– GroupBy – over the grouping fields (or none in case it’s ranking above all records)

– An Every with a custom RankingBuffer• Output pipe (with optional output filter)

[ Filter ]

[ Filter ]

GroupBy

Every

RankingBuffer

GroupBy

Every

RankingBuffer

Post Records

• Input pipe (with optional input filter)• Each – with an Identity operation meant just to select output fields, in

the required order• Output pipe (with optional output filter)

[ Filter ]

[ Filter ]

Each

Identity

Benchmarks

Questions? I … DARE you!

… we’re still on the tip of the iceberg, I promise more in my future presentations!

Until then:

•Follow me on Yammer and ask me questions at any time!

•Interesting links:– https://github.com/Cascading/cascading.samples/tree/master/

wordcount

– http://docs.cascading.org/impatient/

Known not implemented parts

• periods• dates• entities• logging & messages• Pre / Post / Per line validations (except what can be accomplish using Filter

functionality that’s already implemented)

• Future extensions, ideas– Add source Data Connectors

• for maximum throughput (parallel download from external sources)• Currently supported (files: from local file systems / HDFS, Oracle)

– Logi integration

hadoop data transformation

Software

fields input pipe

optional input filter

output pipe

cascading job

input group

cascading details

cascading features

operation input type