data processing with cascading java api on apache hadoop

12
Enterprise Data Workflows with Cascading Cascading is an application framework for developers, data analysts, data scientists to simply develop robust Data Analytics and Data Management applications on Apache Hadoop. Cascading is a query API and query Planner used for defining, sharing, and executing data processing workflows on a distributed data grid or cluster, based upon Apache Hadoop; Cascading greatly simplifies the complexities with Hadoop application development, job creation, and job scheduling; Cascading was developed to allow organizations to rapidly develop complex data processing applications;

Upload: hikmat-dhamee

Post on 23-Jun-2015

438 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Data Processing with Cascading Java API on Apache Hadoop

“Enterprise Data Workflows with Cascading”

• Cascading is an application framework for developers, data analysts, data scientists to

simply develop robust Data Analytics and Data Management applications on Apache Hadoop.

• Cascading is a query API and query Planner used for defining, sharing, and executing data processing workflows on a distributed data grid or cluster, based upon Apache Hadoop;

• Cascading greatly simplifies the complexities with Hadoop application development, job creation, and job scheduling;

• Cascading was developed to allow organizations to rapidly develop complex data processing applications;

Page 2: Data Processing with Cascading Java API on Apache Hadoop

Cascading API

Tap• LFS

• DFS

• HFS

• MultiSourceTap

• MultiSinkTap

• TemplateTap

• GlobHfs

• S3fs(Deprecated)

Scheme • TextLine

• TextDelimited

• SequenceFile

• WritableSequenceFile

TAP types• SinkMode.KEEP

• SinkMode.REPLACE

• SinkMode.UPDATE

Tuple• A single ‘row’ of data being processed

• Each column is named

• Can access data by name or position

How to create Tap?

Fields fields=new Fields(“field_1”, ”field_2, ”...”, “field_n”);

TextDelimited scheme = new TextDelimited(fields, "\t");

Hfs input = new Hfs(scheme, path/to/hdfs_file, SinkMode.REPLACE );

Page 3: Data Processing with Cascading Java API on Apache Hadoop

Pipe Assemblies• Pipe assemblies define what work should be done against a tuple stream, where during runtime

tuple streams are read from Tap sources and are written to Tap sinks.

• Pipe assemblies may have multiple sources and multiple sinks and they can define splits, merges, and joins to manipulate how the tuple streams interact.

Assemblies• Pipe • Each • GroupBy

• CoGroup

• Every

• SubAssembly

Function[Applicable to Each operation]

• Identity Function

• Debug Function

• Sample and Limit Functions

• Insert Function

• Text Functions

• Regular Expression Operations

• Java Expression Operations

• "first-name" is a valid field name for use with Cascading, but this expression, first-name.trim(), will fail.

• Custom Functions

Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );

flow.complete();

Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );

flow.complete();

Page 4: Data Processing with Cascading Java API on Apache Hadoop

Filter[Applicable to Each operation]

• And

• Or

• Not

• Xor

• NotNull

• Null

• RegexFilter

• Custom Filter

Aggregator[Applicable to Every operation]

• Average

• Count

• First

• Last

• Max

• Min

• Sum

• Custom Aggregator

Buffer• It is very similar to the typical Reducer interface

• It is very useful when header or footer values need to be inserted into a grouping, or if values need to be inserted into the middle of the group values

Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );

flow.complete();

Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );

flow.complete();

Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );

flow.complete();

Page 5: Data Processing with Cascading Java API on Apache Hadoop

CoGroup• InnerJoin

• OuterJoin

• LeftJoin

• RightJoin

• MixedJoin

Flow• To create a Flow, it must be planned though the FlowConnector object.

• The connect() method is used to create new Flow instances based on a set of sink Taps, source Taps, and a pipe assembly.

Cascades• Groups of Flow are called Cascades

• Custom MapReduce jobs can participate in Cascade

Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );

flow.complete();

CascadeConnector cascadeConnector=new CascadeConnector();

Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3);

cascade.complete();

LHS = [0,a] [1,b] [2,c] RHS = [0,A] [2,C] [3,D]

InnerJoin: [0,a,0,A] [2,c,2,C] OuterJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C] [null,null,3,D] LeftJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C] RightJoin: [0,a,0,A] [2,c,2,C] [null,null,3,D]

Page 6: Data Processing with Cascading Java API on Apache Hadoop

Flow Connector• HadoopFlowConnector • LocalFlowConnector

MonitoringImplement FlowListener interface

• onStarting

• onStopping

• onCompleted

• onThrowable

Testing• Use ClusterTestCase if you want to launch an embedded Hadoop cluster

inside your TestCase

• A few validation and hadoop functions are provided

• Doesn’t support Hadoop 0.21 testing library

• CascadingTestCase with JUnit

Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );

flow.complete();

Page 7: Data Processing with Cascading Java API on Apache Hadoop

Field AlgebraFields sets are constant values on the Fields class and can be used in many places the Fields class is expected. They are:

• Fields.ALL

• Fields.RESULTS

• Fields.REPLACE

• Fields.SWAP

• Fields.ARGS

• Fields.GROUP

• Fields.VALUES

• Fields.UNKNOWN

Pipe assembly = new Each( assembly, Fields.ALL, function ,Fields.RESULTS);

Page 8: Data Processing with Cascading Java API on Apache Hadoop

Word Counting the Hadoop Way with CascadingAs “word counting” has become the customary “Hello, World!” application for those programmers who are new to Hadoop and Map-Reduce paradigm, let’s have a look at an example in Cascading Hadoop:

HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps

Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );

Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream

Fields token = new Fields( "token" );

Fields text = new Fields( "text" );

RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token"

Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts

Pipe wcPipe = new Pipe( "wc", docPipe );

wcPipe = new GroupBy( wcPipe, token );

wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow

FlowDef flowDef = FlowDef.flowDef()

.setName( "wc" )

.addSource( docPipe, docTap )

.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow

Flow wcFlow = flowConnector.connect( flowDef );

wcFlow.writeDOT( "dot/wc.dot" );

wcFlow.complete();

Page 9: Data Processing with Cascading Java API on Apache Hadoop

Just a few notes worth mentioning about the overall logic adopted in the preceding code listing:

• the code creates some Pipes objects for the input/output of data, then uses a “RegexSplitGenerator” (that implements a regex to split the text on word boundaries) inside an iterator (the “Each” object), that returns another Pipe (the “docPipe”) to split the document text into a token stream.

• A “GroupBy” is defined to count the occurrences of each token, and then the pipes are connected together, like in a “cascade”, via the “flowDef” object.

• Finally, a DOT file is generated, to depict the Cascading flow graphically.The DOT file can be loaded into OmniGraffle or Visio, and is really helpful for troubleshooting MapReduce workflows in Cascading.

Page 10: Data Processing with Cascading Java API on Apache Hadoop

??

Page 11: Data Processing with Cascading Java API on Apache Hadoop

http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/

Page 12: Data Processing with Cascading Java API on Apache Hadoop

END!!!