building complex data workflows with cascading on...

37
Building Complex Data Workflows with Cascading on Hadoop Gagan Agrawal, Sr. Principal Engineer, Snapdeal

Upload: vohuong

Post on 30-Mar-2018

228 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Building Complex Data Workflows with Cascading on Hadoop

Gagan Agrawal, Sr. Principal Engineer, Snapdeal

Page 2: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Agenda

• What is Cascading• Why use Cascading• Cascading Building Blocks• Example Workflows• Testing• Advantages / Disadvantages• Cascading @ Snapdeal

Page 3: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

What is Cascading ?

Page 4: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

What is Cascading ?

• Abstraction over Map Reduce• API for data-processing workflows• JVM framework and SDK for creating abstracted

data flows• Translates data flows into actual

Hadoop/RDBMS/Local jobs

Page 5: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

Page 6: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Options in Hadoop

– Map Reduce API– Pig– Hive

Page 7: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Map Reduce is

– Low Level API– Requires lot of boilerplate code– Can be difficult to chain jobs

Page 8: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Pig

– Good for scripting– Can be difficult to manage multi-module

workflows– Can be difficult to Unit Test– UDF written in different language (Java,

Python (via streaming) etc.)

Page 9: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Hive

– Good for simple queries– Complex workflows can be very difficult to

implement

Page 10: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Hive

– Good for simple queries– Complex workflows can be very difficult to

implement

Page 11: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Why use Cascading ?

• Cascading

– Helps create higher level data processing abstractions

– Sophisticated data pipelines– Re-usable components

Page 12: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

Page 13: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

• Tap– Source– Sink

• Pipes– Each– Every– Merge– GroupBy– CoGroup– HashJoin

Page 14: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

• Tap– Source– Sink

• Pipes– Each– Every– Merge– GroupBy– CoGroup– HashJoin

Page 15: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

• Tap– Source– Sink

• Pipes– Each– Every– Merge– GroupBy– CoGroup– HashJoin

Page 16: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Building Blocks

• Functions• Filter• Aggregator• Buffer• Sub Assemblies• Flows• Cascades•

Page 17: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Terminologies

• Flow – A path for data with some number of inputs,

some operations, and some outputs• Cascade

– A series of connected flows• Operation

– A function applied to data, yielding new data•

Page 18: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading Terminologies

• Pipe– Moves data from some place to some other

place• Tap

– Feeds data from outside the flow into it and writes data from inside the flow out of it.

Page 19: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Pipe Assemblies

• Define work against a Tuple Stream• May have multiple sources and sinks

– Splits– Merges– Joins

Page 20: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Pipe

Page 21: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Example Workflows

Page 22: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Distributed Copy

Page 23: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Word Count

Page 24: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Custom Function

• A function expects a stream of individual tuples• Returns zero or more tuples• Used with Each pipe• To create custom function

– Sublcass cascading.operation.BaseOperation– Implement cascading.operation.Function

Page 25: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Custom Function

Page 26: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Joins

• CoGroup• HashJoin

Page 27: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Joins

Page 28: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

TF-IDF Example

Page 29: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Testing

Page 30: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Testing

• Create flows entirely in code on a local machine• Write tests for controlled sample data sets• Run tests as a regular old Java without needing

access to actual Hadoop or databases• Local machine and CI testing are easy

Page 31: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Advantages / Disadvantages

Page 32: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Advantages

• Re-usability– Pipe assemblies are designed for reuse– Once created and tested, use them in other

flows– Write logic to do something only once

• Simple Stack– Cascading creates DAG of dependent jobs

for us– Removes most of the need for oozie

Page 33: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Advantages

• Simpler Stack– Keeps track of where a flow fails and can

rerun from that point on failure• Common Code Base

– Since everything is written in java, everybody can use same terms and same tech

Page 34: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Disadvantages

• JVM Based– Java / Scala / Clojure

• Doesn't have job scheduler– Can figure out dependency graph for jobs,

but nothing to run them on regular interval– We still need scheduler like quartz

Page 35: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Disadvantages

• No real built-in monitoring• Easy to have a flow report what it has done;

hard to watch it in progress

Page 36: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –

Cascading @ Snapdeal

Page 37: Building Complex Data Workflows with Cascading on …developermarch.com/developersummit/2015/report/... · Building Complex Data Workflows with Cascading on Hadoop ... • Pipes –