intro to cascading

Introduction to

Cascading

What is Cascading?

What is Cascading?

• abstraction over MapReduce

What is Cascading?

• abstraction over MapReduce

• API for data-processing workflows

Why use Cascading?

Why use Cascading?

• MapReduce can be:

Why use Cascading?

• the wrong level of granularity

Why use Cascading?

• the wrong level of granularity

• cumbersome to chain together

Why use Cascading?

Why use Cascading?

• Cascading helps create

Why use Cascading?

• higher-level data processing abstractions

Why use Cascading?

• sophisticated data pipelines

Why use Cascading?

• sophisticated data pipelines

• reusable components

Credits

• Cascading written by Chris Wensel

• based on his users guide

http://bit.ly/cascading

package cascadingtutorial.wordcount;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1];

Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Data Model

Pipes Filters

Data Model

file

Data Model

file

Data Model

file

Data Model

file

Data Model

file

Data Model

file

Data Model

file

Data Model

file

Filters

Tuple

Pipe Assembly

Pipe Assembly

Tuple Stream

Taps: sources & sinks

Taps

Taps

Taps

Taps

Source

Taps

Source Sink

Pipe + Taps = Flow

Flow

Flow

Flow Flow

Flow

n Flows

Flow

Flow Flow

Flow

n Flows = Cascade

Flow

Flow Flow

Flow

n Flows = Cascade

Flow

Flow Flow

Flow

complex

Example

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

TextLine()

TextLine() SequenceFile()

hdfs://master0:54310/user/nmurray/data.txt

hdfs://master0:54310/user/nmurray/data.txt new Hfs()

hdfs://master0:54310/user/nmurray/data.txtdata/sources/obama-inaugural-address.txt

new Hfs()

new Hfs()new Lfs()

new Hfs()new Lfs()

S3fs()

new Hfs()new Lfs()

S3fs()GlobHfs()

Pipe Assembly

Pipe Assemblies

Pipe Assemblies

Pipe Assemblies

• Define work against a Tuple Stream

Pipe Assemblies

• May have multiple sources and sinks

Pipe Assemblies

• Splits

Pipe Assemblies

• Splits

• Merges

Pipe Assemblies

• Splits

• Merges

• Joins

Pipe Assemblies

• Pipe

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Applies aFunction or Filter

Operation to each Tuple

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Group(& merge)

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Joins(inner, outer, left, right)

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Applies an Aggregator (count, sum) to every group of Tuples.

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Nesting

Each vs. Every

Each vs. Every

• Each() is for individual Tuples

Each vs. Every

• Each() is for individual Tuples

• Every() is for groups of Tuples

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Each

Each

Each

Each

Each

Each

Each

Fields()

Each

offset line

0 the lazy brown fox

Each

offset line

Each

offset line

0 the lazy brown foxX

Each

offset line

word

lazy

word

brown

word

fox

word

the

X

Operations: Functions

• Insert()

• Insert()

• RegexParser() / RegexGenerator() / RegexReplace()

• Insert()

• DateParser() / DateFormatter()

• Insert()

• XPathGenerator()

• Insert()

• Identity()

• Insert()

• Identity()

• etc...

Operations: Filters

Operations: Filters

• RegexFilter()

Operations: Filters

• RegexFilter()

• FilterNull()

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

• ExpressionFilter()

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

• ExpressionFilter()

• etc...

Each

Each

Every

new Every(previousPipe, argumentSelector, operation, outputSelector)

Every

new Every(wcPipe, new Count(), new Fields("count", "word"));

Operations: Aggregator

• Average()

• Average()

• Count()

• Average()

• Count()

• First() / Last()

• Average()

• Count()

• Min() / Max()

• Average()

• Count()

• Min() / Max()

• Sum()

Every

Every

Every

new Every(wcPipe, Fields.ALL, new Count(), new Fields("count", "word"));

Field Selection

Predefined Field Sets

Fields.ALL all available fields

Fields.GROUP fields used for last grouping

Fields.VALUES fields not used for last grouping

Fields.ARGS fields of argument Tuple(for Operations)

Fields.RESULTS replaces input with Operation result (for Pipes)

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("timestamp"));

Field Selection

Operation Input Tuple =

Original Tuple Argument Selector

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

Field Selection

id visitor_ id

search_term

timestamp

page_number

Original Tuple

Field Selection

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

Field Selection

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

Field Selection

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

timestamp

Input to DateParser will be:

Field Selection

Output Tuple =

Original Tuple ⊕ Operation Tuple Output Selector

Field Selection

Field Selection

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Field Selection

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

⊕

Field Selection

DateParser Output

tsidvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

⊕

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

⊕

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

⊕

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

⊕

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Output of Each will be:

ts search_term

visitor_ id

⊕

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

ts search_term

visitor_ id

⊕

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

ts search_term

visitor_ id

⊕

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

ts search_term

visitor_ id

⊕

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

ts search_term

visitor_ id

⊕X X X

word

hello

word

world

word

hello

word

hello

word

world

word

hello

word

world

word

hello

word

hello

word

world

count word

2 hello

count word

1 world

Run

input: mary had a little lamblittle lamblittle lambmary had a little lambwhose fleece was white as snow

input: mary had a little lamblittle lamblittle lambmary had a little lambwhose fleece was white as snow

hadoop jar ./target/cascading-tutorial-1.0.0.jar \ data/sources/misc/mary.txt data/output

command:

output: 2 a1 as1 fleece2 had4 lamb4 little2 mary1 snow1 was1 white1 whose

What’s next?

apply operations to tuple streams

apply operations to tuple streams

repeat.

Cookbook

Cookbook

// remove all tuples where search_term is '-' pipe = new Each(pipe, new Fields("search_term"), new RegexFilter("^(\\-)$", true));

Cookbook

// convert "timestamp" String field to "ts" long field pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), Fields.ALL);

Cookbook

// td - timestamp of day-level granularity pipe = new Each(pipe, new Fields("ts"), new ExpressionFunction( new Fields("td"), "ts - (ts % (24 * 60 * 60 * 1000))", long.class), Fields.ALL);

Cookbook // SELECT DISTINCT visitor_id

// GROUP BY visitor_id ORDER BY ts pipe = new GroupBy(pipe, new Fields("visitor_id"), new Fields("ts"));

// take the First() tuple of every grouping pipe = new Every(pipe, Fields.ALL, new First(), Fields.RESULTS);

Custom Operations

Desired Operation

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

Desired Operation

line

mary had a little lamb

Desired Operation

line

ngram

mary had

Desired Operation

line

ngram

mary had

ngram

had a

Desired Operation

line

ngram

mary had

ngram

had a

ngram

a little

Desired Operation

line

ngram

mary had

ngram

had a

ngram

a little

ngram

little lamb

Custom Operations

Custom Operations

• extend BaseOperation

Custom Operations

• implement Function (or Filter)

Custom Operations

• implement Function (or Filter)

• implement operate

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}