data processing with cascading java api on apache hadoop

“Enterprise Data Workflows with Cascading”

• Cascading is an application framework for developers, data analysts, data scientists to

simply develop robust Data Analytics and Data Management applications on Apache Hadoop.

• Cascading is a query API and query Planner used for defining, sharing, and executing data processing workflows on a distributed data grid or cluster, based upon Apache Hadoop;

• Cascading greatly simplifies the complexities with Hadoop application development, job creation, and job scheduling;

• Cascading was developed to allow organizations to rapidly develop complex data processing applications;

Cascading API

Tap• LFS

• DFS

• HFS

• MultiSourceTap

• MultiSinkTap

• TemplateTap

• GlobHfs

• S3fs(Deprecated)

Scheme • TextLine

• TextDelimited

• SequenceFile

• WritableSequenceFile

TAP types• SinkMode.KEEP

• SinkMode.REPLACE

• SinkMode.UPDATE

Tuple• A single ‘row’ of data being processed

• Each column is named

• Can access data by name or position

How to create Tap?

Fields fields=new Fields(“field_1”, ”field_2, ”...”, “field_n”);

TextDelimited scheme = new TextDelimited(fields, "\t");

Hfs input = new Hfs(scheme, path/to/hdfs_file, SinkMode.REPLACE );

Pipe Assemblies• Pipe assemblies define what work should be done against a tuple stream, where during runtime

tuple streams are read from Tap sources and are written to Tap sinks.

• Pipe assemblies may have multiple sources and multiple sinks and they can define splits, merges, and joins to manipulate how the tuple streams interact.

Assemblies• Pipe • Each • GroupBy

• CoGroup

• Every

• SubAssembly

Function[Applicable to Each operation]

• Identity Function

• Debug Function

• Sample and Limit Functions

• Insert Function

• Text Functions

• Regular Expression Operations

• Java Expression Operations

• "first-name" is a valid field name for use with Cascading, but this expression, first-name.trim(), will fail.

• Custom Functions

Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );

flow.complete();


flow.complete();

Filter[Applicable to Each operation]

• And

• Or

• Not

• Xor

• NotNull

• Null

• RegexFilter

• Custom Filter

Aggregator[Applicable to Every operation]

• Average

• Count

• First

• Last

• Max

• Min

• Sum

• Custom Aggregator

Buffer• It is very similar to the typical Reducer interface

• It is very useful when header or footer values need to be inserted into a grouping, or if values need to be inserted into the middle of the group values


flow.complete();


flow.complete();


flow.complete();

CoGroup• InnerJoin

• OuterJoin

• LeftJoin

• RightJoin

• MixedJoin

Flow• To create a Flow, it must be planned though the FlowConnector object.

• The connect() method is used to create new Flow instances based on a set of sink Taps, source Taps, and a pipe assembly.

Cascades• Groups of Flow are called Cascades

• Custom MapReduce jobs can participate in Cascade


flow.complete();

CascadeConnector cascadeConnector=new CascadeConnector();

Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3);

cascade.complete();

LHS = [0,a] [1,b] [2,c] RHS = [0,A] [2,C] [3,D]

InnerJoin: [0,a,0,A] [2,c,2,C] OuterJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C] [null,null,3,D] LeftJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C] RightJoin: [0,a,0,A] [2,c,2,C] [null,null,3,D]

Flow Connector• HadoopFlowConnector • LocalFlowConnector

MonitoringImplement FlowListener interface

• onStarting

• onStopping

• onCompleted

• onThrowable

Testing• Use ClusterTestCase if you want to launch an embedded Hadoop cluster

inside your TestCase

• A few validation and hadoop functions are provided

• Doesn’t support Hadoop 0.21 testing library

• CascadingTestCase with JUnit


flow.complete();

Field AlgebraFields sets are constant values on the Fields class and can be used in many places the Fields class is expected. They are:

• Fields.ALL

• Fields.RESULTS

• Fields.REPLACE

• Fields.SWAP

• Fields.ARGS

• Fields.GROUP

• Fields.VALUES

• Fields.UNKNOWN

Pipe assembly = new Each( assembly, Fields.ALL, function ,Fields.RESULTS);

Word Counting the Hadoop Way with CascadingAs “word counting” has become the customary “Hello, World!” application for those programmers who are new to Hadoop and Map-Reduce paradigm, let’s have a look at an example in Cascading Hadoop:

HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps

Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );

Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream

Fields token = new Fields( "token" );

Fields text = new Fields( "text" );

RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token"

Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts

Pipe wcPipe = new Pipe( "wc", docPipe );

wcPipe = new GroupBy( wcPipe, token );

wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow

FlowDef flowDef = FlowDef.flowDef()

.setName( "wc" )

.addSource( docPipe, docTap )

.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow

Flow wcFlow = flowConnector.connect( flowDef );

wcFlow.writeDOT( "dot/wc.dot" );

wcFlow.complete();

Just a few notes worth mentioning about the overall logic adopted in the preceding code listing:

• the code creates some Pipes objects for the input/output of data, then uses a “RegexSplitGenerator” (that implements a regex to split the text on word boundaries) inside an iterator (the “Each” object), that returns another Pipe (the “docPipe”) to split the document text into a token stream.

• A “GroupBy” is defined to count the occurrences of each token, and then the pipes are connected together, like in a “cascade”, via the “flowDef” object.

• Finally, a DOT file is generated, to depict the Cascading flow graphically.The DOT file can be loaded into OmniGraffle or Visio, and is really helpful for troubleshooting MapReduce workflows in Cascading.

http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/

END!!!

data processing with cascading java api on apache hadoop

Software