intro to cascading

209
Introduction to Cascading

Upload: nate-murray

Post on 26-Jan-2015

10.436 views

Category:

Technology


1 download

DESCRIPTION

A practical introduction to Chris Wensel's Cascading MapReduce/Hadoop framework.

TRANSCRIPT

Page 1: Intro To Cascading

Introduction to

Cascading

Page 2: Intro To Cascading

What is Cascading?

Page 3: Intro To Cascading

What is Cascading?

• abstraction over MapReduce

Page 4: Intro To Cascading

What is Cascading?

• abstraction over MapReduce

• API for data-processing workflows

Page 5: Intro To Cascading

Why use Cascading?

Page 6: Intro To Cascading

Why use Cascading?

• MapReduce can be:

Page 7: Intro To Cascading

Why use Cascading?

• MapReduce can be:

• the wrong level of granularity

Page 8: Intro To Cascading

Why use Cascading?

• MapReduce can be:

• the wrong level of granularity

• cumbersome to chain together

Page 9: Intro To Cascading

Why use Cascading?

Page 10: Intro To Cascading

Why use Cascading?

• Cascading helps create

Page 11: Intro To Cascading

Why use Cascading?

• Cascading helps create

• higher-level data processing abstractions

Page 12: Intro To Cascading

Why use Cascading?

• Cascading helps create

• higher-level data processing abstractions

• sophisticated data pipelines

Page 13: Intro To Cascading

Why use Cascading?

• Cascading helps create

• higher-level data processing abstractions

• sophisticated data pipelines

• reusable components

Page 14: Intro To Cascading

Credits

• Cascading written by Chris Wensel

• based on his users guide

http://bit.ly/cascading

Page 15: Intro To Cascading

package cascadingtutorial.wordcount;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1];

Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 16: Intro To Cascading

Data Model

Pipes Filters

Page 17: Intro To Cascading

Data Model

file

Page 18: Intro To Cascading

Data Model

file

Page 19: Intro To Cascading

Data Model

file

Page 20: Intro To Cascading

Data Model

file

Page 21: Intro To Cascading

Data Model

file

Page 22: Intro To Cascading

Data Model

file

Page 23: Intro To Cascading

Data Model

file

file

file

Page 24: Intro To Cascading

Data Model

file

file

file

Filters

Page 25: Intro To Cascading

Tuple

Page 26: Intro To Cascading

Pipe Assembly

Page 27: Intro To Cascading

Pipe Assembly

Tuple Stream

Page 28: Intro To Cascading

Taps: sources & sinks

Page 29: Intro To Cascading

Taps: sources & sinks

Taps

Page 30: Intro To Cascading

Taps: sources & sinks

Taps

Page 31: Intro To Cascading

Taps: sources & sinks

Taps

Page 32: Intro To Cascading

Taps: sources & sinks

Taps

Source

Page 33: Intro To Cascading

Taps: sources & sinks

Taps

Source Sink

Page 34: Intro To Cascading
Page 35: Intro To Cascading

Pipe + Taps = Flow

Flow

Page 36: Intro To Cascading

Flow

Flow

Flow Flow

Flow

Page 37: Intro To Cascading

n Flows

Flow

Flow

Flow Flow

Flow

Page 38: Intro To Cascading

n Flows = Cascade

Flow

Flow

Flow Flow

Flow

Page 39: Intro To Cascading

n Flows = Cascade

Flow

Flow

Flow Flow

Flow

complex

Page 40: Intro To Cascading

Example

Page 41: Intro To Cascading

package cascadingtutorial.wordcount;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1];

Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 42: Intro To Cascading

package cascadingtutorial.wordcount;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1];

Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 43: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 44: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 45: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

TextLine()

Page 46: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

TextLine() SequenceFile()

Page 47: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 48: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 49: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

hdfs://master0:54310/user/nmurray/data.txt

Page 50: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

hdfs://master0:54310/user/nmurray/data.txt new Hfs()

Page 51: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

hdfs://master0:54310/user/nmurray/data.txtdata/sources/obama-inaugural-address.txt

new Hfs()

Page 52: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

hdfs://master0:54310/user/nmurray/data.txtdata/sources/obama-inaugural-address.txt

new Hfs()new Lfs()

Page 53: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

hdfs://master0:54310/user/nmurray/data.txtdata/sources/obama-inaugural-address.txt

new Hfs()new Lfs()

S3fs()

Page 54: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

hdfs://master0:54310/user/nmurray/data.txtdata/sources/obama-inaugural-address.txt

new Hfs()new Lfs()

S3fs()GlobHfs()

Page 55: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 56: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 57: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Pipe Assembly

Page 58: Intro To Cascading

Pipe Assemblies

Page 59: Intro To Cascading

Pipe Assemblies

Page 60: Intro To Cascading

Pipe Assemblies

• Define work against a Tuple Stream

Page 61: Intro To Cascading

Pipe Assemblies

• Define work against a Tuple Stream

• May have multiple sources and sinks

Page 62: Intro To Cascading

Pipe Assemblies

• Define work against a Tuple Stream

• May have multiple sources and sinks

• Splits

Page 63: Intro To Cascading

Pipe Assemblies

• Define work against a Tuple Stream

• May have multiple sources and sinks

• Splits

• Merges

Page 64: Intro To Cascading

Pipe Assemblies

• Define work against a Tuple Stream

• May have multiple sources and sinks

• Splits

• Merges

• Joins

Page 65: Intro To Cascading

Pipe Assemblies

Page 66: Intro To Cascading

• Pipe

Pipe Assemblies

Page 67: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Page 68: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Page 69: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Applies aFunction or Filter

Operation to each Tuple

Pipe Assemblies

Page 70: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Page 71: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Page 72: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Group(& merge)

Page 73: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Joins(inner, outer, left, right)

Page 74: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Page 75: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Applies an Aggregator (count, sum) to every group of Tuples.

Page 76: Intro To Cascading

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Nesting

Page 77: Intro To Cascading

Each vs. Every

Page 78: Intro To Cascading

Each vs. Every

• Each() is for individual Tuples

Page 79: Intro To Cascading

Each vs. Every

• Each() is for individual Tuples

• Every() is for groups of Tuples

Page 80: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Page 81: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Page 82: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Page 83: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Page 84: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 85: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Page 86: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Page 87: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Page 88: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Fields()

Page 89: Intro To Cascading

Each

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

offset line

0 the lazy brown fox

Page 90: Intro To Cascading

Each

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

offset line

0 the lazy brown fox

Page 91: Intro To Cascading

Each

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

offset line

0 the lazy brown foxX

Page 92: Intro To Cascading

Each

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

offset line

0 the lazy brown fox

word

lazy

word

brown

word

fox

word

the

X

Page 93: Intro To Cascading

Operations: Functions

Page 94: Intro To Cascading

Operations: Functions

• Insert()

Page 95: Intro To Cascading

Operations: Functions

• Insert()

• RegexParser() / RegexGenerator() / RegexReplace()

Page 96: Intro To Cascading

Operations: Functions

• Insert()

• RegexParser() / RegexGenerator() / RegexReplace()

• DateParser() / DateFormatter()

Page 97: Intro To Cascading

Operations: Functions

• Insert()

• RegexParser() / RegexGenerator() / RegexReplace()

• DateParser() / DateFormatter()

• XPathGenerator()

Page 98: Intro To Cascading

Operations: Functions

• Insert()

• RegexParser() / RegexGenerator() / RegexReplace()

• DateParser() / DateFormatter()

• XPathGenerator()

• Identity()

Page 99: Intro To Cascading

Operations: Functions

• Insert()

• RegexParser() / RegexGenerator() / RegexReplace()

• DateParser() / DateFormatter()

• XPathGenerator()

• Identity()

• etc...

Page 100: Intro To Cascading

Operations: Filters

Page 101: Intro To Cascading

Operations: Filters

• RegexFilter()

Page 102: Intro To Cascading

Operations: Filters

• RegexFilter()

• FilterNull()

Page 103: Intro To Cascading

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

Page 104: Intro To Cascading

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

• ExpressionFilter()

Page 105: Intro To Cascading

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

• ExpressionFilter()

• etc...

Page 106: Intro To Cascading

Each

new Each(previousPipe, argumentSelector, operation, outputSelector)

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Page 107: Intro To Cascading

Each

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

Page 108: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 109: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 110: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 111: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 112: Intro To Cascading

Every

new Every(previousPipe, argumentSelector, operation, outputSelector)

Page 113: Intro To Cascading

Every

new Every(previousPipe, argumentSelector, operation, outputSelector)

new Every(wcPipe, new Count(), new Fields("count", "word"));

Page 114: Intro To Cascading

new Every(wcPipe, new Count(), new Fields("count", "word"));

Page 115: Intro To Cascading

Operations: Aggregator

Page 116: Intro To Cascading

Operations: Aggregator

• Average()

Page 117: Intro To Cascading

Operations: Aggregator

• Average()

• Count()

Page 118: Intro To Cascading

Operations: Aggregator

• Average()

• Count()

• First() / Last()

Page 119: Intro To Cascading

Operations: Aggregator

• Average()

• Count()

• First() / Last()

• Min() / Max()

Page 120: Intro To Cascading

Operations: Aggregator

• Average()

• Count()

• First() / Last()

• Min() / Max()

• Sum()

Page 121: Intro To Cascading

Every

new Every(previousPipe, argumentSelector, operation, outputSelector)

Page 122: Intro To Cascading

Every

new Every(previousPipe, argumentSelector, operation, outputSelector)

new Every(wcPipe, new Count(), new Fields("count", "word"));

Page 123: Intro To Cascading

Every

new Every(previousPipe, argumentSelector, operation, outputSelector)

new Every(wcPipe, new Count(), new Fields("count", "word"));

new Every(wcPipe, Fields.ALL, new Count(), new Fields("count", "word"));

Page 124: Intro To Cascading

Field Selection

Page 125: Intro To Cascading

Predefined Field Sets

Page 126: Intro To Cascading

Predefined Field Sets

Fields.ALL all available fields

Fields.GROUP fields used for last grouping

Fields.VALUES fields not used for last grouping

Fields.ARGS fields of argument Tuple(for Operations)

Fields.RESULTS replaces input with Operation result (for Pipes)

Page 127: Intro To Cascading

Predefined Field Sets

Fields.ALL all available fields

Fields.GROUP fields used for last grouping

Fields.VALUES fields not used for last grouping

Fields.ARGS fields of argument Tuple(for Operations)

Fields.RESULTS replaces input with Operation result (for Pipes)

Page 128: Intro To Cascading

Predefined Field Sets

Fields.ALL all available fields

Fields.GROUP fields used for last grouping

Fields.VALUES fields not used for last grouping

Fields.ARGS fields of argument Tuple(for Operations)

Fields.RESULTS replaces input with Operation result (for Pipes)

Page 129: Intro To Cascading

Predefined Field Sets

Fields.ALL all available fields

Fields.GROUP fields used for last grouping

Fields.VALUES fields not used for last grouping

Fields.ARGS fields of argument Tuple(for Operations)

Fields.RESULTS replaces input with Operation result (for Pipes)

Page 130: Intro To Cascading

Predefined Field Sets

Fields.ALL all available fields

Fields.GROUP fields used for last grouping

Fields.VALUES fields not used for last grouping

Fields.ARGS fields of argument Tuple(for Operations)

Fields.RESULTS replaces input with Operation result (for Pipes)

Page 131: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("timestamp"));

new Each(previousPipe, argumentSelector, operation, outputSelector)

Page 132: Intro To Cascading

Field Selection

new Each(previousPipe, argumentSelector, operation, outputSelector)

Operation Input Tuple =

Original Tuple Argument Selector

Page 133: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

Page 134: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

id visitor_ id

search_term

timestamp

page_number

Original Tuple

Page 135: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

Page 136: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

Page 137: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

timestamp

Input to DateParser will be:

Page 138: Intro To Cascading

Field Selection

new Each(previousPipe, argumentSelector, operation, outputSelector)

Output Tuple =

Original Tuple ⊕ Operation Tuple Output Selector

Page 139: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

Page 140: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Page 141: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Page 142: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

tsidvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Page 143: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Page 144: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Page 145: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Page 146: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Output of Each will be:

ts search_term

visitor_ id

Page 147: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Output of Each will be:

ts search_term

visitor_ id

Page 148: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Output of Each will be:

ts search_term

visitor_ id

Page 149: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Output of Each will be:

ts search_term

visitor_ id

Page 150: Intro To Cascading

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Output of Each will be:

ts search_term

visitor_ id

⊕X X X

Page 151: Intro To Cascading

new Every(wcPipe, new Count(), new Fields("count", "word"));

word

hello

word

world

word

hello

word

hello

word

world

Page 152: Intro To Cascading

new Every(wcPipe, new Count(), new Fields("count", "word"));

word

hello

word

world

word

hello

word

hello

word

world

count word

2 hello

count word

1 world

Page 153: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

new Every(wcPipe, new Count(), new Fields("count", "word"));

Page 154: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 155: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 156: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 157: Intro To Cascading

package cascadingtutorial.wordcount;

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Page 158: Intro To Cascading

Run

Page 159: Intro To Cascading

input: mary had a little lamblittle lamblittle lambmary had a little lambwhose fleece was white as snow

Page 160: Intro To Cascading

input: mary had a little lamblittle lamblittle lambmary had a little lambwhose fleece was white as snow

hadoop jar ./target/cascading-tutorial-1.0.0.jar \ data/sources/misc/mary.txt data/output

command:

Page 161: Intro To Cascading

output: 2 a1 as1 fleece2 had4 lamb4 little2 mary1 snow1 was1 white1 whose

Page 162: Intro To Cascading

What’s next?

Page 163: Intro To Cascading

apply operations to tuple streams

Page 164: Intro To Cascading

apply operations to tuple streams

repeat.

Page 165: Intro To Cascading

Cookbook

Page 166: Intro To Cascading

Cookbook

// remove all tuples where search_term is '-' pipe = new Each(pipe, new Fields("search_term"), new RegexFilter("^(\\-)$", true));

Page 167: Intro To Cascading

Cookbook

// convert "timestamp" String field to "ts" long field pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), Fields.ALL);

Page 168: Intro To Cascading

Cookbook

// td - timestamp of day-level granularity pipe = new Each(pipe, new Fields("ts"), new ExpressionFunction( new Fields("td"), "ts - (ts % (24 * 60 * 60 * 1000))", long.class), Fields.ALL);

Page 169: Intro To Cascading

Cookbook // SELECT DISTINCT visitor_id

// GROUP BY visitor_id ORDER BY ts pipe = new GroupBy(pipe, new Fields("visitor_id"), new Fields("ts"));

// take the First() tuple of every grouping pipe = new Every(pipe, Fields.ALL, new First(), Fields.RESULTS);

Page 170: Intro To Cascading

Custom Operations

Page 171: Intro To Cascading

Desired Operation

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

Page 172: Intro To Cascading

Desired Operation

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

line

mary had a little lamb

Page 173: Intro To Cascading

Desired Operation

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

line

mary had a little lamb

ngram

mary had

Page 174: Intro To Cascading

Desired Operation

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

line

mary had a little lamb

ngram

mary had

ngram

had a

Page 175: Intro To Cascading

Desired Operation

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

line

mary had a little lamb

ngram

mary had

ngram

had a

ngram

a little

Page 176: Intro To Cascading

Desired Operation

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

line

mary had a little lamb

ngram

mary had

ngram

had a

ngram

a little

ngram

little lamb

Page 177: Intro To Cascading

Custom Operations

Page 178: Intro To Cascading

Custom Operations

• extend BaseOperation

Page 179: Intro To Cascading

Custom Operations

• extend BaseOperation

• implement Function (or Filter)

Page 180: Intro To Cascading

Custom Operations

• extend BaseOperation

• implement Function (or Filter)

• implement operate

Page 181: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 182: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 183: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 184: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 185: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 186: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 187: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 188: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 189: Intro To Cascading

Tuple 123 vis-abc tacos 2009-10-0803:21:23 1

Page 190: Intro To Cascading

Tuple 123 vis-abc tacos 2009-10-0803:21:23 1

zero-indexed, ordered values

Page 191: Intro To Cascading

Tuple 123 vis-abc tacos 2009-10-0803:21:23 1

id visitor_ id

search_term

timestamp

page_numberFields

list of strings representing field names

Page 192: Intro To Cascading

Tuple 123 vis-abc tacos 2009-10-0803:21:23 1

id visitor_ id

search_term

timestamp

page_numberFields

Page 193: Intro To Cascading

123 vis-abc tacos 2009-10-0803:21:23 1

id visitor_ id

search_term

timestamp

page_number

TupleEntry

Page 194: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 195: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 196: Intro To Cascading

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Page 197: Intro To Cascading

Success

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

line

mary had a little lamb

ngram

mary had

ngram

had a

ngram

a little

ngram

little lamb

Page 198: Intro To Cascading

code examples

Page 199: Intro To Cascading

dot

Page 200: Intro To Cascading

dotFlow importFlow = flowConnector.connect(sourceTap, tmpIntermediateTap, importPipe);importFlow.writeDOT( "import-flow.dot" );

Page 201: Intro To Cascading
Page 202: Intro To Cascading
Page 203: Intro To Cascading
Page 204: Intro To Cascading

What next?

Page 205: Intro To Cascading

What next?

Page 207: Intro To Cascading

What next?

• http://www.cascading.org/

• http://bit.ly/cascading

Page 209: Intro To Cascading