creating map-reduce programs using hadoop. presentation overview recall hadoop overview of the...

Creating Map-Reduce Programs Using Hadoop

Presentation Overview

Recall Hadoop

Overview of the map-reduce paradigm

Elaboration on the WordCount example

components of Hadoop that make WordCount possible

Major new example: N-Gram Generator

step-by-step assembly of this map-reduce job

Design questions to ask when creating your own Hadoop jobs

Recall why Hadoop rocks

Hadoop is:

Free and open source

high quality, like all Apache Foundation projects

crossplatform (pure Java)

fault-tolerant

highly scalable

has bindings for non-Java programming languages

applicable to many computational problems

Map-Reduce System Overview

JobTracker – Makes scheduling decisions

TaskTracker – Manages tasks for a given node

Task process

Runs an individual map or reduce fragment for a given job

Forks from the TaskTracker

Map-Reduce System Overview

Processes communicate by custom RPC implementation

Easy to change/extend

Defined as Java interfaces

Server objects implement the interface

Client proxy objects automatically created

All messages originate at the client: (e.g., Task to TaskTracker)

Prevents cycles and therefore deadlocks

Process Flow Diagram

Application Overview

Launching Program

Creates a JobConf to define a job.

Submits JobConf to JobTracker and waits for completion.

Mapper

Is given a stream of key1,value1 pairs

Generates a stream of key2, value2 pairs

Reducer

Is given a key2 and a stream of value2’s

Generates a stream of key3, value3 pairs

Job Launch Process: Client

Client program creates a JobConfIdentify classes implementing Mapper and Reducer interfaces

JobConf.setMapperClass(); JobConf.setReducerClass()

Specify input and output formats

JobConf.setInputFormat(TextInputFormat.class);

JobConf.setOutputFormat(TextOutputFormat.class);

Other options too:

JobConf.setNumReduceTasks()

JobConf.setOutputFormat()

Many, many more (Facade pattern)

An onslaught of terminology

We'll explain these terms, each of which plays a role in any non-trivial map/reduce job:

InputFormat, OutputFormat, FileInputFormat, ...

JobClient and JobConf

JobTracker and TaskTracker

TaskRunner, MapTaskRunner, MapRunner, …

InputSplit, RecordReader, LineRecordReader, ...

Writable, WritableComparable, WritableInt, ...

InputFormat and OutputFormatThe application also chooses input and output formats, which

define how the persistent data is read and written. These are interfaces and can be defined by the application.

InputFormat

Splits the input to determine the input to each map task.

Defines a RecordReader that reads key, value pairs that are passed to the map task

OutputFormat

Given the key, value pairs and a filename, writes the reduce task output to persistent store.

Examplepublic static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}

Job Launch Process: JobClient

Pass JobConf to JobClient.runJob() or JobClient.submitJob()runJob() blocks – wait until job finishes

submitJob() does notPoll for status to make running decisions

Avoid polling with JobConf.setJobEndNotificationURI()

JobClient: Determines proper division of input into InputSplits

Sends job data to master JobTracker server

Job Launch Process: JobTracker

JobTracker: Inserts jar and JobConf (serialized to XML) in shared location

Posts a JobInProgress to its run queue

Job Launch Process: TaskTracker

TaskTrackers running on slave nodes periodically query JobTracker for work

Retrieve job-specific jar and config

Launch task in separate instance of Javamain() is provided by Hadoop

Job Launch Process: Task

TaskTracker.Child.main():Sets up the child TaskInProgress attempt

Reads XML configuration

Connects back to necessary MapReduce components via RPC

Uses TaskRunner to launch user process

Job Launch Process: TaskRunner

TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it should be mappingCalls Mapper once for each record retrieved from the InputSplit

Running the Reducer is much the same

Creating the Mapper

You provide the instance of MapperShould extend MapReduceBase

Implement interface Mapper<K1,V1,K2,V2>

One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgressExists in separate process from all other instances of Mapper – no data sharing!

Mapper

Override function – map()

void map(WritableComparable key,

Writable value,

OutputCollector output,

Reporter reporter)

Emit (k2,v2) with output.collect(k2, v2)

Example

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one);

}

}

}

What is Writable?

Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.

All values are instances of Writable

All keys are instances of WritableComparable

Reading data

Data sets are specified by InputFormatsDefines input data (e.g., a directory)

Identifies partitions of the data that form an InputSplit

Factory for RecordReader objects to extract (k, v) records from the input source

FileInputFormat and friends

TextInputFormat – Treats each ‘\n’-terminated line of a file as a value

KeyValueTextInputFormat – Maps ‘\n’- terminated text lines of “k SEP v”

SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata

SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())

Filtering File Inputs

FileInputFormat will read all files out of a specified directory and send them to the mapper

Delegates filtering this file list to a method subclasses may overridee.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list

Record Readers

Without a RecordReader, Hadoop would be forced to divide input on byte boundaries.

Each InputFormat provides its own RecordReader implementationProvides capability multiplexing

LineRecordReader – Reads a line from a text file

KeyValueRecordReader – Used by KeyValueTextInputFormat

Input Split Size

FileInputFormat will divide large files into chunksExact size controlled by mapred.min.split.size

RecordReaders receive file, offset, and length of chunk

Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”

Sending Data To Reducers

Map function receives OutputCollector objectOutputCollector.collect() takes (k, v) elements

Any (WritableComparable, Writable) can be used

WritableComparator

Compares WritableComparable dataWill call WritableComparable.compare()

Can provide fast path for serialized dataExplicitly stated in JobConf setup

JobConf.setOutputValueGroupingComparator()

Sending Data To The Client

Reporter object sent to Mapper allows simple asynchronous feedbackincrCounter(Enum key, long amount)

setStatus(String msg)

Allows self-identification of inputInputSplit getInputSplit()

Partitioner

int getPartition(key, val, numPartitions)Outputs the partition number for a given key

One partition == values sent to one Reduce task

HashPartitioner used by defaultUses key.hashCode() to return partition num

JobConf sets Partitioner implementation

Reducer

reduce( WritableComparable key,

Iterator values, OutputCollector output, Reporter reporter)

Keys & values sent to one partition all go to the same reduce taskCalls are sorted by key – “earlier” keys are reduced and output before “later” keys

Example

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}

OutputFormat

Analogous to InputFormat

TextOutputFormat – Writes “key val\n” strings to output file

SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs

NullOutputFormat – Discards output

Presentation Overview

Recall Hadoop

Overview of the map-reduce paradigm

Elaboration on the WordCount example

components of Hadoop that make WordCount possible

Major new example: N-Gram Generator

step-by-step assembly of this map-reduce job

Design questions to ask when creating your own Hadoop jobs

Major example: N-Gram GenerationN-Gram is a common natural language processing

technique (used by Google, etc)

N-Gram is a subsequence of N items in a given sequence. (i.e. subsequence of words in a given text)

Example 3-grams (from Google) with corresponding occurrences

ceramics collectables collectibles (55)

ceramics collected by (52)

ceramics collectibles cooking (45)

Understanding the process

Someone wise said, “A week of writing code saves an hour of research.”

Before embarking on developing a Hadoop job, walk through the process step by step manually and understand the flow and manipulation of data.

Once you can comfortably (and deterministically!) do it mentally, begin writing code.

Requirements

Input:

a beginning word/phrase

n-gram size (bigram, trigram, n-gram)

the minimum number of occurrences (frequency)

whether letter case matters

Output: all possible n-grams that occur sufficiently frequently.

High-level view of data flow

Given: one or more files containing regular text.

Look for the desired startword. If seen, take the next N-1 words and add the group to the database.

Similarly to word count, find the number of occurrences of each N-gram.

Remove those N-grams that do not occur frequently enough for our liking.

Follow along

The N-grams implementation exists and is ready for your perusal.

Grab it:

if you use Git revision control:

git clone git://git.qnan.org/pmw/hadoop-ngram

to get the files with your browser, go to:

http://www.qnan.org/~pmw/software/hadoop-ngram

We used Project Gutenberg ebooks as input.

Follow along

Start Hadoop

bin/start-all.sh

Grab the NGram code and build it:

Type “ant” and all will be built

Look at the README to see how to run it.

Load some text files into your HDFS

good source: http://www.gutenberg.org

Run it yourself (or see me do it) before we proceed.

Can we just use WordCount?

We have the WordCount example that does a similar thing. But there are differences:

We don't want to count the number of times our startword appears; we want to capture the subsequent words too.

A more subtle problem is that wordcount maps one line at a time. That's a problem if we want 3-grams with startword of “pillows” in the book containing this:

The guests stayed in the guest bedroom; the pillows were

delightfully soft and had a faint scent of mint.

Still, WordCount is a good foundation for our code.

Steps we must performRead our text in paragraphs rather than in discrete lines:

RecordReader

InputFormat

Develop the mapper and reducer classes:

first mapper: find startword, get the next N-1 words, and return <N-gram, 1>

first reducer: sum the number of occurrences of each N-gram

second mapper: no action

second reducer: discard N-grams that are too rare

Driver program

A new RecordReader

Ours must implement RecordReader<K, V>

Contain certain functions: createKey(), createValue(), getPos(), getProgress(), next()

Hadoop offers a LineRecordReader but no support for Paragraphs

We'll need a ParagraphRecordReader

Use Delegation Pattern instead of extending LineRecordReader. We couldn't extend it because it has private elements.

Create new next() function

public synchronized boolean next(LongWritable key, Text value) throws IOException { Text linevalue = new Text(); boolean appended, gotsomething; boolean retval; byte space[] = {' '};

value.clear(); gotsomething = false; do { appended = false; retval = lrr.next(key, linevalue); if (retval) { if (linevalue.toString().length() > 0) { byte[] rawline = linevalue.getBytes(); int rawlinelen = linevalue.getLength(); value.append(rawline, 0, rawlinelen); value.append(space, 0, 1); appended = true; } gotsomething = true; } } while (appended); //System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting value to: ["+value.toString()+"]"); return gotsomething;}

A new InputFormat

Given to the JobTracker during execution

getRecordReader method

This is the why we need InputFormat

Must return our ParagraphRecordReader

public class ParagraphInputFormat extends FileInputFormat<LongWritable, Text> implements JobConfigurable {

private CompressionCodecFactory compressionCodecs = null;

public void configure(JobConf conf) { compressionCodecs = new CompressionCodecFactory(conf); }

protected boolean isSplitable(FileSystem fs, Path file) { return compressionCodecs.getCodec(file) == null; }

public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {

reporter.setStatus(genericSplit.toString()); return new ParagraphRecordReader(job, (FileSplit) genericSplit); }}

First stage: “Find” Mapper

Define the startword at startup

Each time map is called we parse an entire paragraph and output matching N-Grams

Tell Reporter how far done we are to track progress

Output <N-Gram, 1> like WordCount

output.collect(ngram, new IntWritable(1));

This last part is important... next slide explains.

Importance of “output.collect()”

Remember Hadoop's data type model:

map: (K1, V

1) → list(K

2, V

2)

This means that for every single (K1, V

1) tuple, the

map stage can output zero, one, two, or any other number of tuples, and they don't have to match the input at all.

Example:

output.collect(ngram, new IntWritable(1));

output.collect(“good-ol'-”+ngram, new IntWritable(0));

Find Mapper

Our mapper must have a configure() class

We can pass primitives through JobConf

public void configure(JobConf conf) { desiredPhrase = conf.get("mapper.desired-phrase"); Nvalue = conf.getInt("mapper.N-value", 3); caseSensitive = conf.getBoolean("mapper.case-sensitive", false); }

“Find” Reducer

Like WordCount example

Sum all the numbers matching our N-Gram

Output <N-Gram, #of occurences>

Second stage: “Prune” Mapper

Parse line from previous output and divide into Key/Value pairs

“Prune” Reducer

This way we can sort our elements by frequency

If this N-Gram occurs fewer times than our minimum, trim it out

Piping data between M/R jobs

How does the “Find” map/reduce job pass its results to the “Reduce” map/reduce job?

I create a temporary file within HDFS. This temporary file is used as the output of Find and the input of Reduce.

At the end, I delete the temporary file.

Counters

The N-Gram generator has one programmer-defined counter: the number of partial/incomplete N-grams. These occur when a paragraph ends before we can read N-1 subsequent words.

We can add as many counters as we want.

JobConf

We need to set everything up

2 Jobs executing in series Find and Prune

User inputs parameters

Starting N-Gram word/phrase

N-Gram size

Minimum frequency for pruning

JobConf ngram_find_conf = new JobConf(getConf(), NGram.class), ngram_prune_conf = new JobConf(getConf(), NGram.class);

Find JobConf

Now we can plug everything in:

Also pass input parameters

And point to our input and output files

ngram_find_conf.setJobName("ngram-find"); ngram_find_conf.setInputFormat(ParagraphInputFormat.class); ngram_find_conf.setOutputKeyClass(Text.class); ngram_find_conf.setOutputValueClass(IntWritable.class); ngram_find_conf.setMapperClass(FindJob_MapClass.class); ngram_find_conf.setReducerClass(FindJob_ReduceClass.class);

ngram_find_conf.set("mapper.desired-phrase", args.get(2), true)); ngram_find_conf.setInt("mapper.N-value", new Integer(other_args.get(3)).intValue()); ngram_find_conf.setBoolean("mapper.case-sensitive", caseSensitive);

FileInputFormat.setInputPaths(ngram_find_conf, other_args.get(0)); FileOutputFormat.setOutputPath(ngram_find_conf, tempDir);

Prune JobConf

Perform set up as before

We need to point our inputs to the outputs of the previous job

ngram_prune_conf.setJobName("ngram-prune"); ngram_prune_conf.setInt("reducer.min-freq", min_freq); ngram_prune_conf.setOutputKeyClass(Text.class); ngram_prune_conf.setOutputValueClass(IntWritable.class); ngram_prune_conf.setMapperClass(PruneJob_MapClass.class); ngram_prune_conf.setReducerClass(PruneJob_ReduceClass.class);

FileInputFormat.setInputPaths(ngram_prune_conf, tempDir); FileOutputFormat.setOutputPath(ngram_prune_conf, new Path(other_args.get(1)));

Execute Jobs

Run as blocking process with runJob

Batch processing is done in series JobClient.runJob(ngram_find_conf); JobClient.runJob(ngram_prune_conf);

Design questions to ask

From where will my input come?

InputFileFormat

How is my input structured?

RecordReader

(There are already several common IFFs and RRs. Don't reinvent the wheel.)

Mapper and Reducer classes

Do Key (WritableComparator) and Value (Writable) classes exist?

Design questions to ask

Do I need to count anything while job is in progress?

Where is my output going?

Executor class

What information do my map/reduce classes need? Must I block, waiting for job completion? Set FileFormat?

creating map-reduce programs using hadoop. presentation overview recall hadoop overview of the...

Documents