creating map-reduce programs using hadoop. presentation overview recall hadoop overview of the...
TRANSCRIPT
Creating Map-Reduce Programs Using Hadoop
Presentation Overview
Recall Hadoop
Overview of the map-reduce paradigm
Elaboration on the WordCount example
components of Hadoop that make WordCount possible
Major new example: N-Gram Generator
step-by-step assembly of this map-reduce job
Design questions to ask when creating your own Hadoop jobs
Recall why Hadoop rocks
Hadoop is:
Free and open source
high quality, like all Apache Foundation projects
crossplatform (pure Java)
fault-tolerant
highly scalable
has bindings for non-Java programming languages
applicable to many computational problems
Map-Reduce System Overview
JobTracker – Makes scheduling decisions
TaskTracker – Manages tasks for a given node
Task process
Runs an individual map or reduce fragment for a given job
Forks from the TaskTracker
Map-Reduce System Overview
Processes communicate by custom RPC implementation
Easy to change/extend
Defined as Java interfaces
Server objects implement the interface
Client proxy objects automatically created
All messages originate at the client: (e.g., Task to TaskTracker)
Prevents cycles and therefore deadlocks
Process Flow Diagram
Application Overview
Launching Program
Creates a JobConf to define a job.
Submits JobConf to JobTracker and waits for completion.
Mapper
Is given a stream of key1,value1 pairs
Generates a stream of key2, value2 pairs
Reducer
Is given a key2 and a stream of value2’s
Generates a stream of key3, value3 pairs
Job Launch Process: Client
Client program creates a JobConfIdentify classes implementing Mapper and Reducer interfaces
JobConf.setMapperClass(); JobConf.setReducerClass()
Specify input and output formats
JobConf.setInputFormat(TextInputFormat.class);
JobConf.setOutputFormat(TextOutputFormat.class);
Other options too:
JobConf.setNumReduceTasks()
JobConf.setOutputFormat()
Many, many more (Facade pattern)
An onslaught of terminology
We'll explain these terms, each of which plays a role in any non-trivial map/reduce job:
InputFormat, OutputFormat, FileInputFormat, ...
JobClient and JobConf
JobTracker and TaskTracker
TaskRunner, MapTaskRunner, MapRunner, …
InputSplit, RecordReader, LineRecordReader, ...
Writable, WritableComparable, WritableInt, ...
InputFormat and OutputFormatThe application also chooses input and output formats, which
define how the persistent data is read and written. These are interfaces and can be defined by the application.
InputFormat
Splits the input to determine the input to each map task.
Defines a RecordReader that reads key, value pairs that are passed to the map task
OutputFormat
Given the key, value pairs and a filename, writes the reduce task output to persistent store.
Examplepublic static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Job Launch Process: JobClient
Pass JobConf to JobClient.runJob() or JobClient.submitJob()runJob() blocks – wait until job finishes
submitJob() does notPoll for status to make running decisions
Avoid polling with JobConf.setJobEndNotificationURI()
JobClient: Determines proper division of input into InputSplits
Sends job data to master JobTracker server
Job Launch Process: JobTracker
JobTracker: Inserts jar and JobConf (serialized to XML) in shared location
Posts a JobInProgress to its run queue
Job Launch Process: TaskTracker
TaskTrackers running on slave nodes periodically query JobTracker for work
Retrieve job-specific jar and config
Launch task in separate instance of Javamain() is provided by Hadoop
Job Launch Process: Task
TaskTracker.Child.main():Sets up the child TaskInProgress attempt
Reads XML configuration
Connects back to necessary MapReduce components via RPC
Uses TaskRunner to launch user process
Job Launch Process: TaskRunner
TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it should be mappingCalls Mapper once for each record retrieved from the InputSplit
Running the Reducer is much the same
Creating the Mapper
You provide the instance of MapperShould extend MapReduceBase
Implement interface Mapper<K1,V1,K2,V2>
One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgressExists in separate process from all other instances of Mapper – no data sharing!
Mapper
Override function – map()
void map(WritableComparable key,
Writable value,
OutputCollector output,
Reporter reporter)
Emit (k2,v2) with output.collect(k2, v2)
Example
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
What is Writable?
Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.
All values are instances of Writable
All keys are instances of WritableComparable
Reading data
Data sets are specified by InputFormatsDefines input data (e.g., a directory)
Identifies partitions of the data that form an InputSplit
Factory for RecordReader objects to extract (k, v) records from the input source
FileInputFormat and friends
TextInputFormat – Treats each ‘\n’-terminated line of a file as a value
KeyValueTextInputFormat – Maps ‘\n’- terminated text lines of “k SEP v”
SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata
SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())
Filtering File Inputs
FileInputFormat will read all files out of a specified directory and send them to the mapper
Delegates filtering this file list to a method subclasses may overridee.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
Record Readers
Without a RecordReader, Hadoop would be forced to divide input on byte boundaries.
Each InputFormat provides its own RecordReader implementationProvides capability multiplexing
LineRecordReader – Reads a line from a text file
KeyValueRecordReader – Used by KeyValueTextInputFormat
Input Split Size
FileInputFormat will divide large files into chunksExact size controlled by mapred.min.split.size
RecordReaders receive file, offset, and length of chunk
Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”
Sending Data To Reducers
Map function receives OutputCollector objectOutputCollector.collect() takes (k, v) elements
Any (WritableComparable, Writable) can be used
WritableComparator
Compares WritableComparable dataWill call WritableComparable.compare()
Can provide fast path for serialized dataExplicitly stated in JobConf setup
JobConf.setOutputValueGroupingComparator()
Sending Data To The Client
Reporter object sent to Mapper allows simple asynchronous feedbackincrCounter(Enum key, long amount)
setStatus(String msg)
Allows self-identification of inputInputSplit getInputSplit()
Partitioner
int getPartition(key, val, numPartitions)Outputs the partition number for a given key
One partition == values sent to one Reduce task
HashPartitioner used by defaultUses key.hashCode() to return partition num
JobConf sets Partitioner implementation
Reducer
reduce( WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
Keys & values sent to one partition all go to the same reduce taskCalls are sorted by key – “earlier” keys are reduced and output before “later” keys
Example
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
OutputFormat
Analogous to InputFormat
TextOutputFormat – Writes “key val\n” strings to output file
SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs
NullOutputFormat – Discards output
Presentation Overview
Recall Hadoop
Overview of the map-reduce paradigm
Elaboration on the WordCount example
components of Hadoop that make WordCount possible
Major new example: N-Gram Generator
step-by-step assembly of this map-reduce job
Design questions to ask when creating your own Hadoop jobs
Major example: N-Gram GenerationN-Gram is a common natural language processing
technique (used by Google, etc)
N-Gram is a subsequence of N items in a given sequence. (i.e. subsequence of words in a given text)
Example 3-grams (from Google) with corresponding occurrences
ceramics collectables collectibles (55)
ceramics collected by (52)
ceramics collectibles cooking (45)
Understanding the process
Someone wise said, “A week of writing code saves an hour of research.”
Before embarking on developing a Hadoop job, walk through the process step by step manually and understand the flow and manipulation of data.
Once you can comfortably (and deterministically!) do it mentally, begin writing code.
Requirements
Input:
a beginning word/phrase
n-gram size (bigram, trigram, n-gram)
the minimum number of occurrences (frequency)
whether letter case matters
Output: all possible n-grams that occur sufficiently frequently.
High-level view of data flow
Given: one or more files containing regular text.
Look for the desired startword. If seen, take the next N-1 words and add the group to the database.
Similarly to word count, find the number of occurrences of each N-gram.
Remove those N-grams that do not occur frequently enough for our liking.
Follow along
The N-grams implementation exists and is ready for your perusal.
Grab it:
if you use Git revision control:
git clone git://git.qnan.org/pmw/hadoop-ngram
to get the files with your browser, go to:
http://www.qnan.org/~pmw/software/hadoop-ngram
We used Project Gutenberg ebooks as input.
Follow along
Start Hadoop
bin/start-all.sh
Grab the NGram code and build it:
Type “ant” and all will be built
Look at the README to see how to run it.
Load some text files into your HDFS
good source: http://www.gutenberg.org
Run it yourself (or see me do it) before we proceed.
Can we just use WordCount?
We have the WordCount example that does a similar thing. But there are differences:
We don't want to count the number of times our startword appears; we want to capture the subsequent words too.
A more subtle problem is that wordcount maps one line at a time. That's a problem if we want 3-grams with startword of “pillows” in the book containing this:
The guests stayed in the guest bedroom; the pillows were
delightfully soft and had a faint scent of mint.
Still, WordCount is a good foundation for our code.
Steps we must performRead our text in paragraphs rather than in discrete lines:
RecordReader
InputFormat
Develop the mapper and reducer classes:
first mapper: find startword, get the next N-1 words, and return <N-gram, 1>
first reducer: sum the number of occurrences of each N-gram
second mapper: no action
second reducer: discard N-grams that are too rare
Driver program
A new RecordReader
Ours must implement RecordReader<K, V>
Contain certain functions: createKey(), createValue(), getPos(), getProgress(), next()
Hadoop offers a LineRecordReader but no support for Paragraphs
We'll need a ParagraphRecordReader
Use Delegation Pattern instead of extending LineRecordReader. We couldn't extend it because it has private elements.
Create new next() function
public synchronized boolean next(LongWritable key, Text value) throws IOException { Text linevalue = new Text(); boolean appended, gotsomething; boolean retval; byte space[] = {' '};
value.clear(); gotsomething = false; do { appended = false; retval = lrr.next(key, linevalue); if (retval) { if (linevalue.toString().length() > 0) { byte[] rawline = linevalue.getBytes(); int rawlinelen = linevalue.getLength(); value.append(rawline, 0, rawlinelen); value.append(space, 0, 1); appended = true; } gotsomething = true; } } while (appended); //System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting value to: ["+value.toString()+"]"); return gotsomething;}
A new InputFormat
Given to the JobTracker during execution
getRecordReader method
This is the why we need InputFormat
Must return our ParagraphRecordReader
public class ParagraphInputFormat extends FileInputFormat<LongWritable, Text> implements JobConfigurable {
private CompressionCodecFactory compressionCodecs = null;
public void configure(JobConf conf) { compressionCodecs = new CompressionCodecFactory(conf); }
protected boolean isSplitable(FileSystem fs, Path file) { return compressionCodecs.getCodec(file) == null; }
public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString()); return new ParagraphRecordReader(job, (FileSplit) genericSplit); }}
First stage: “Find” Mapper
Define the startword at startup
Each time map is called we parse an entire paragraph and output matching N-Grams
Tell Reporter how far done we are to track progress
Output <N-Gram, 1> like WordCount
output.collect(ngram, new IntWritable(1));
This last part is important... next slide explains.
Importance of “output.collect()”
Remember Hadoop's data type model:
map: (K1, V
1) → list(K
2, V
2)
This means that for every single (K1, V
1) tuple, the
map stage can output zero, one, two, or any other number of tuples, and they don't have to match the input at all.
Example:
output.collect(ngram, new IntWritable(1));
output.collect(“good-ol'-”+ngram, new IntWritable(0));
Find Mapper
Our mapper must have a configure() class
We can pass primitives through JobConf
public void configure(JobConf conf) { desiredPhrase = conf.get("mapper.desired-phrase"); Nvalue = conf.getInt("mapper.N-value", 3); caseSensitive = conf.getBoolean("mapper.case-sensitive", false); }
“Find” Reducer
Like WordCount example
Sum all the numbers matching our N-Gram
Output <N-Gram, #of occurences>
Second stage: “Prune” Mapper
Parse line from previous output and divide into Key/Value pairs
“Prune” Reducer
This way we can sort our elements by frequency
If this N-Gram occurs fewer times than our minimum, trim it out
Piping data between M/R jobs
How does the “Find” map/reduce job pass its results to the “Reduce” map/reduce job?
I create a temporary file within HDFS. This temporary file is used as the output of Find and the input of Reduce.
At the end, I delete the temporary file.
Counters
The N-Gram generator has one programmer-defined counter: the number of partial/incomplete N-grams. These occur when a paragraph ends before we can read N-1 subsequent words.
We can add as many counters as we want.
JobConf
We need to set everything up
2 Jobs executing in series Find and Prune
User inputs parameters
Starting N-Gram word/phrase
N-Gram size
Minimum frequency for pruning
JobConf ngram_find_conf = new JobConf(getConf(), NGram.class), ngram_prune_conf = new JobConf(getConf(), NGram.class);
Find JobConf
Now we can plug everything in:
Also pass input parameters
And point to our input and output files
ngram_find_conf.setJobName("ngram-find"); ngram_find_conf.setInputFormat(ParagraphInputFormat.class); ngram_find_conf.setOutputKeyClass(Text.class); ngram_find_conf.setOutputValueClass(IntWritable.class); ngram_find_conf.setMapperClass(FindJob_MapClass.class); ngram_find_conf.setReducerClass(FindJob_ReduceClass.class);
ngram_find_conf.set("mapper.desired-phrase", args.get(2), true)); ngram_find_conf.setInt("mapper.N-value", new Integer(other_args.get(3)).intValue()); ngram_find_conf.setBoolean("mapper.case-sensitive", caseSensitive);
FileInputFormat.setInputPaths(ngram_find_conf, other_args.get(0)); FileOutputFormat.setOutputPath(ngram_find_conf, tempDir);
Prune JobConf
Perform set up as before
We need to point our inputs to the outputs of the previous job
ngram_prune_conf.setJobName("ngram-prune"); ngram_prune_conf.setInt("reducer.min-freq", min_freq); ngram_prune_conf.setOutputKeyClass(Text.class); ngram_prune_conf.setOutputValueClass(IntWritable.class); ngram_prune_conf.setMapperClass(PruneJob_MapClass.class); ngram_prune_conf.setReducerClass(PruneJob_ReduceClass.class);
FileInputFormat.setInputPaths(ngram_prune_conf, tempDir); FileOutputFormat.setOutputPath(ngram_prune_conf, new Path(other_args.get(1)));
Execute Jobs
Run as blocking process with runJob
Batch processing is done in series JobClient.runJob(ngram_find_conf); JobClient.runJob(ngram_prune_conf);
Design questions to ask
From where will my input come?
InputFileFormat
How is my input structured?
RecordReader
(There are already several common IFFs and RRs. Don't reinvent the wheel.)
Mapper and Reducer classes
Do Key (WritableComparator) and Value (Writable) classes exist?
Design questions to ask
Do I need to count anything while job is in progress?
Where is my output going?
Executor class
What information do my map/reduce classes need? Must I block, waiting for job completion? Set FileFormat?