hadoop data transformation
DESCRIPTION
TRANSCRIPT
Hadoop “Data Transformation”
[about BigData] “Size matters not. Loot at me”
[about this presentation] “Became a little bit more knowledgeable, you will. Fear have not, your mind just open…” (could be) Yoda – Star Wars
“The dark side clouds everything. Impossible to see the future is.” (actual quote) Yoda – Star Wars
Constantin CiureanuSoftware Architect ([email protected])Project “Hadoop Data Transformation”
Hadoop - outro
Hadoop MapReduce framework is:•Simple (but not trivial) yet powerful•Used to process large amounts of data in•Highly efficient, Scalable, fault-tolerant•Incredibly useful (It actually solves a lot of problems)•Has proven to be an invaluable tool for many companies over the past few years
Well, … surprise! Plain MapReduce is not sufficient for Optymyze!
Why? Because the abstraction level of Optymyze “Data Transformation” could bring any Hadoop developer to its knees (or worse, see the next slide)!
“May the force be with you!”
Cascading – to the rescue
Cascading framework allows for easy development of complex Hadoop MapReduce jobs
•A Flow is a sequence of manipulations on pipes which are flowing tuple streams•Any flow is reusable and can be used to compose bigger flows•At runtime, each flow is transformed in one or more MapReduce jobs (which are executed in dependencies order)
Cascading
Optymyze Data
Transformation
Cascading
Hadoop
Cascading – details
Cascading – details
• Cascading represents data as Tuples (a single row of data to be processed)
(“A”, 100)
(“B”, 140)
• Inputs and Outputs are called Taps, using one of the following schemes: TextLine, TextDelimited, SequenceFile, WritableSequenceFile
• Each Tap produces or receives a pipe of tuples with the same format• If it’s needed, there is support for multiple inputs / multiple outputs• Cascading has integration with Avro, HBase, JDBC, MemCached, SimpleDB etc.
• Cascading is very flexible, offers a lot of functionality and can be easily extended
Cascading - example
[“letter”,”occurrences”]
(“A”, 1)
(“B”, 5)
(“A”, 2)
(“B”, 6)
(“A”, 3)
By applying a GroupBy(“letter”) produces:
(“A”, [1,2,3])
(“B”, [5,6])
A subsequent Sum(“occurrences”) produces:
[“letter”, “sum”]
(“A”, 6)
(“B”, 11)
Cascading - features
• Each – executes some processing for each line in the input tuples stream• GroupBy – (similar to SQL) groups the lines from input using a set of grouping fields
(task achieved in the Reduce phase, while the Map phase if just decorating the output with some grouping flags)
• CoGroup – join together some tuple streams• Every – executes some processing (aggregator, buffer) for a group of lines in the
input (group being generated by a CoGroup of GroupBy)• HashJoin – allows for two or more tuple streams to join into a single stream via a
Joiner when all but one tuple stream is considered small enough to fit into memory. • Split - takes a single stream and sends it down via multiple paths • Merge - allows for multiple streams, with the same fields to be merged back into a
single stream.
Hadoop Data Transformation
• A job = List<Process>– DataLoad – a process– DataTransformation process = List<Operation>
• Aggregate Records• Post Records• Rank Records• Calculate Fields• Calculate Running Fields• Merge Fields• Merge Records• Roll Up Using Hierarchy
– DataExport – a process
Data Load
• Copies the file to HDFS• Executes a Cascading job that checks for duplicates• Stores the result on HDFS
Hadoop Data Transformation
File(s)File(s)File(s)
File(s) to be
loaded to Oracle
Hadoop HDFS
FolderPart-filePart-filePart-filePart-file
DT Job
DT Process
DT Process
DT Process
FolderPart-filePart-filePart-filePart-file
PostRecords
Oracle
Data Transformation - Operation
Input Tap - according to operation input type:•TABLE – use a JDBC Tap•OPERATION – either:
– an Output pipe – from a previous operation– a HDFS folder
•(optional) Input Filter•The actual processing block for this operation•(optional) Output Filter•Output Tap:
– operation – in this case the output is stored on HDFS as a folder
– PostOperation – an Oracle destination table, 2 ways to get data there:
• JDBC output tap (batch inserts rows into destination table)• HDFS output tap (to store the output as a HDFS folder. At the end the
destination table will be loaded using “external” table)
[ Filter ]
Operation
[ Filter ]
Aggregate Records
• Input pipe (with optional input filter)• GroupBy – on grouping fields from AggregateRecords definition• As many Every steps as needed (one for each AggregateCalculation),
each of them iterating over the GroupBy results adding an Aggregator (see below)
• Output pipe (with optional output filter)
• Aggregators (applicable for all the values in the relevant column = current group values)
– SumAggregator– AverageAggregator– Last(Character)Aggregator– Max(Character)Aggregator– Min(Character)Aggregator– CountAggregator– CountDistinctAggregator– MedianAggregator– StdDevAggregator with sample and population option
•
[ Filter ]
[ Filter ]
GroupBy
Every
Aggr
egat
or
Aggr
egat
or
Aggr
egat
or
Calculate Fields
• Input pipe (with optional input filter)• Each – with a CustomCalculateFields Function including all calculation
definitions. • A calculation definition might:
– create a new field in the input tuple (that can be later on reused in subsequent formulas)– overwrite an existing field with a new value
• Each calculated field is a “math tree” – allows to define recursively the calculation formula (existing input fields & values are used in functions or by other calculated fields)
• Output pipe (with optional output filter)
[ Filter ]
[ Filter ]
Each
Function
Calc
ulati
on
Calc
ulati
on
Calc
ulati
on
Merge Fields
• Input pipe(s) (with optional input filter)• A CoGroup – using the definition MatchingCondition,
placing together all the left and right hand side columns• A Discard pipe – that will remove duplicated columns (the
ones used in the join)• Output pipe (with optional output filter)
[ Filter ]
[ Filter ]
CoGroup
Discard
[ Filter ]
Rank Records
• Input pipe (with optional input filter)• For all RankingCalculations:
– GroupBy – over the grouping fields (or none in case it’s ranking above all records)
– An Every with a custom RankingBuffer• Output pipe (with optional output filter)
[ Filter ]
[ Filter ]
GroupBy
Every
RankingBuffer
GroupBy
Every
RankingBuffer
Post Records
• Input pipe (with optional input filter)• Each – with an Identity operation meant just to select output fields, in
the required order• Output pipe (with optional output filter)
[ Filter ]
[ Filter ]
Each
Identity
Benchmarks
Questions? I … DARE you!
… we’re still on the tip of the iceberg, I promise more in my future presentations!
Until then:
•Follow me on Yammer and ask me questions at any time!
•Interesting links:– https://github.com/Cascading/cascading.samples/tree/master/
wordcount
– http://docs.cascading.org/impatient/
Known not implemented parts
• periods• dates• entities• logging & messages• Pre / Post / Per line validations (except what can be accomplish using Filter
functionality that’s already implemented)
• Future extensions, ideas– Add source Data Connectors
• for maximum throughput (parallel download from external sources)• Currently supported (files: from local file systems / HDFS, Oracle)
– Logi integration