mapreduce programming with apache hadoop

22
MapReduce programming with Apache Hadoop Process massive data sets in parallel on large clusters By Ravi Shankar and Govindu Narendra, JavaWorld.com, 09/23/08 Google and its MapReduce framework may rule the roost when it comes to massive-scale data processing, but there's still plenty of that goodness to go around. This article gets you started with Hadoop, the open source MapReduce implementation for processing large data sets. Authors Ravi Shankar and Govindu Narendra first demonstrate the powerful combination of map and reduce in a simple Java program, then walk you through a more complex data-processing application based on Hadoop. Finally, they show you how to install and deploy your application in both standalone mode and clustering mode. Are you amazed by the fast response you get while searching the Web with Google or Yahoo? Have you ever wondered how these services manage to search millions of pages and return your results in milliseconds or less? The algorithms that drive both of these major-league search services originated with Google's MapReduce framework. While MapReduce is proprietary technology, the Apache Foundation has implemented its own open source map-reduce framework, called Hadoop . Hadoop is used by Yahoo and many other services whose success is based on processing massive amounts of data. In this article we'll help you discover whether it might also be a good solution for your distributed data processing needs. We'll start with an overview of MapReduce, followed by a couple of Java programs that demonstrate the simplicity and power of the framework. We'll then introduce you to Hadoop's

Upload: reddappa-gowd

Post on 12-Nov-2014

1.726 views

Category:

Documents


4 download

DESCRIPTION

How to handle and process big bunch of map image data and also gives a programming techniques for effective data handle patterns.

TRANSCRIPT

Page 1: MapReduce Programming With Apache Hadoop

MapReduce programming with Apache HadoopProcess massive data sets in parallel on large clusters

By Ravi Shankar and Govindu Narendra, JavaWorld.com, 09/23/08

Google and its MapReduce framework may rule the roost when it comes to massive-scale data processing, but there's still plenty of that goodness to go around. This article gets you started with Hadoop, the open source MapReduce implementation for processing large data sets. Authors Ravi Shankar and Govindu Narendra first demonstrate the powerful combination of map and reduce in a simple Java program, then walk you through a more complex data-processing application based on Hadoop. Finally, they show you how to install and deploy your application in both standalone mode and clustering mode.

Are you amazed by the fast response you get while searching the Web with Google or Yahoo? Have you ever wondered how these services manage to search millions of pages and return your results in milliseconds or less? The algorithms that drive both of these major-league search services originated with Google's MapReduce framework. While MapReduce is proprietary technology, the Apache Foundation has implemented its own open source map-reduce framework, called Hadoop. Hadoop is used by Yahoo and many other services whose success is based on processing massive amounts of data. In this article we'll help you discover whether it might also be a good solution for your distributed data processing needs.

We'll start with an overview of MapReduce, followed by a couple of Java programs that demonstrate the simplicity and power of the framework. We'll then introduce you to Hadoop's MapReduce implementation and walk through a complex application that searches a huge log file for a specific string. Finally, we'll show you how to install Hadoop in a Microsoft Windows environment and deploy the application -- first as a standalone application and then in clustering mode.

You won't be an expert in all things Hadoop when you're done reading this article, but you will have enough material to explore and possibly implement Hadoop for your own large-scale data-processing requirements.

About MapReduce

MapReduce is a programming model specifically implemented for processing large data sets. The model was developed by Jeffrey Dean and Sanjay Ghemawat at Google (see "MapReduce: Simplified data processing on large clusters"). At its core, MapReduce is a combination of two functions -- map() and reduce(), as its name would suggest.

Page 2: MapReduce Programming With Apache Hadoop

A quick look at a sample Java program will help you get your bearings in MapReduce. This application implements a very simple version of the MapReduce framework, but isn't built on Hadoop. The simple, abstracted program will illustrate the core parts of the MapReduce framework and the terminology associated with it. The application creates some strings, counts the number of characters in each string, and finally sums them up to show the total number of characters altogether. Listing 1 contains the program's Main class.

Listing 1. Main class for a simple MapReduce Java app

public class Main{

public static void main(String[] args) {

MyMapReduce my = new MyMapReduce(); my.init();

}}

Listing 1 just instantiates a class called MyMapReduce, which is shown in Listing 2.

Listing 2. MyMapReduce.java

import java.util.*;

public class MyMapReduce

...

Download complete Listing 2

As you see, the crux of the class lies in just four functions:

The init() method creates some dummy data (just 30 strings). This data serves as the input data for the program. Note that in the real world, this input could be gigabytes, terabytes, or petabytes of data!

The step1ConvertIntoBuckets() method segments the input data. In this example, the data is divided into six smaller chunks and put inside an ArrayList named buckets. You can see that the method takes a list, which contains all of the input data, and another int value, numberOfBuckets. This value has been hardcoded to five; if you divide 30 strings into five buckets, each bucket will have six strings each. Each bucket in turn is represented as an ArrayList. These array lists are put finally into another list and returned. So, at the end of the function, you have an array list with five buckets (array lists) of six strings each.

Page 3: MapReduce Programming With Apache Hadoop

These buckets can be put in memory (as in this case), saved to disk, or put onto different nodes in a cluster!

step2RunMapFunctionForAllBuckets() is the next method invoked from init(). This method internally creates five threads (because there are five buckets -- the idea is to start a thread for each bucket). The class responsible for threading is StartThread, which is implemented as an inner class. Each thread processes each bucket and puts the individual result in another array list named intermediateresults. All the computation and threading takes place within the same JVM, and the whole process runs on a single machine.

If the buckets were on different machines, a master should be monitoring them to know when the computation is over, if there are any failures in processing in any of the nodes, and so on. It would be great if the master could perform the computations on different nodes, rather than bringing the data from all nodes to the master itself and executing it.

The step3RunReduceFunctionForAllBuckets() method collates the results from intermediateresults, sums it up, and gives you the final output.

Note that intermediateresults needs to combine the results from the parallel processing explained in the previous bullet point. The exciting part is that this process also can happen concurrently!

A more complicated scenario

Processing 30 input elements doesn't really make for an interesting scenario. Imagine instead that there are 100,000 elements of data to be processed. The task at hand is to search for the total number of occurrences of the word JavaWorld. The data may be structured or unstructured. Here's how you'd approach it:

Assume that, in some way, the data is divided into smaller chunks and is inserted into buckets. You have a total of 10 buckets now, with 10,000 elements of data within each of them. (Don't bother worrying about who exactly does the dividing at the moment.)

Apply a function named map(), which in turn executes your search algorithm on a single bucket and repeats it concurrently for all the buckets in parallel, storing the result (of processing of each bucket) in another set of buckets, called result buckets. Note that there may be more than one result bucket.

Apply a function named reduce() on each of these result buckets. This function iterates through the result buckets, takes in each value, and then performs some kind of processing, if needed. The processing may either aggregate the individual values or apply some kind of business logic on the aggregated or individual values. This functionality once again takes place concurrently.

Finally, you will get the result you expected.

Page 4: MapReduce Programming With Apache Hadoop

These four steps are very simple but there is so much power in them! Let's look at the details.

Dividing the data

In Step 1, note that the buckets created by someone for you may be on a single machine or on multiple machines (though they must be on the same cluster in that case). In practice, that means that in large enterprise projects, multiple terabytes or petabytes of data could be segmented into thousands of buckets on different machines in the cluster, and processing could be performed in parallel, giving the user an extremely fast response. Google uses this concept to index every Web page it crawls. If you take advantage of the power of the underlying filesystem used for storing the data in individual machines of the cluster, the result could be more fascinating. Google uses the proprietary Google File System (GFS) for this.

The map() function

In Step 2, the map() function understands exactly where it should go to process the data. The source of data may be memory, or disk, or another node in the cluster. Please note that bringing data to the place where the map() function resides is more costly and time-consuming than letting the function execute at the place where the data resides. If you write a C++ or Java program to process data on multiple threads, then the program fetches data from a data source (typically a remote database server) and is usually executed on the machine where your application is running. In MapReduce implementations, the computation happens on the distributed nodes.

The reduce() function

In Step 3, the reduce() function operates on one or more lists of intermediate results by fetching each of them from memory, disk, or a network transfer and performing a function on each element of each list. The final result of the complete operation is performed by collating and interpreting the results from all processes running reduce() operations.

In Step 4, you get the final output, which can be either 0 or some data element.

With this simple Java program under your belt, you're ready to understand the more complex MapReduce implementation in Apache Hadoop.

Apache Hadoop

Hadoop is an open source implementation of the MapReduce programming model. Hadoop relies not on Google File System (GFS), but on its own Hadoop Distributed File System (HDFS). HDFS replicates data blocks in a reliable manner and places them on different nodes; computation is then performed by Hadoop on these nodes. HDFS is

Page 5: MapReduce Programming With Apache Hadoop

similar to other filesystems, but is designed to be highly fault tolerant. This distributed filesystem does not require any high-end hardware, but can run on commodity computers and software; it is also scalable, which is one of the primary design goals for the implementation. HDFS is independent of any specific hardware or software platform, and is hence easily portable across heterogeneous systems.

If you've worked with clustered Java EE applications, you're probably familiar with the concepts of a master instance that manages other instances of the application server (called slaves) in a network deployment architecture. These master instances may be called deployment managers (if you're using WebSphere), manager servers (with WebLogic) or admin servers (with Tomcat). It is the responsibility of the master server instance to delegate various responsibilities to slave application server instances, to listen for handshaking signals from each instance so as to decide which are alive and which are dead, to do IP multicasting whenever required for synchronization of serializable sessions and data, and other similar tasks. The master stores the metadata and relevant port information of the slaves and works in a collaborative manner so that the end user feels as if there is only one instance.

HDFS works more or less in a similar way. In the HDFS architecture, the master is called a NameNode and the slaves are called DataNodes. There is only a single NameNode in HDFS, whereas there are many DataNodes across the cluster, usually one per node. HDFS allocates a namespace (similar to a package in Java, a tablespace in Oracle, or a namespace in C++) for storing user data. A file might be split into one or more data blocks, and these data blocks are kept in a set of DataNodes. The NameNode will have the necessary metadata information on how the blocks are mapped to each other and which blocks are being stored in which of the NameNodes. Note that not all the requests to be delegated to DataNodes need to pass through the NameNode. All the filesystem's client requests for reading and writing are processed directly by the DataNodes, whereas namespace operations like the opening, closing, and renaming of directories are performed by NameNodes. NameNodes are responsible for issuing instructions to DataNodes for data block creation, replication, and deletion.

A typical deployment of HDFS has a dedicated machine that runs only the NameNode. Each of the other machines in the cluster typically runs one instance of the DataNode software, though the architecture does allow you to run multiple DataNodes on the same machine. The NameNode is concerned with metadata repository and control, but otherwise never handles user data. The NameNode uses a special kind of log, named EditLog, for the persistence of metadata.

Deploying Hadoop

Though Hadoop is a pure Java implementation, you can use it in two different ways. You can either take advantage of a streaming API provided with it or use Hadoop pipes. The latter option allows you to build Hadoop apps with C++; this article will focus on the former.

Page 6: MapReduce Programming With Apache Hadoop

Hadoop's main design goal is to provide storage and communication on lots of homogeneous commodity machines. The implementers selected Linux as their initial platform for development and testing; hence, if you're working with Hadoop on Windows, you will have to install separate software to mimic the shell environment.

Hadoop can run in three different ways, depending on how the processes are distributed:

Standalone mode: This is the default mode provided with Hadoop. Everything is run as a single Java process.

Pseudo-distributed mode: Here, Hadoop is configured to run on a single machine, with different Hadoop daemons run as different Java processes.

Fully distributed or cluster mode: Typically, one machine in the cluster is designated as the NameNode and another machine as the JobTracker. There is exactly one NameNode in each cluster, which manages the namespace, filesystem metadata, and access control. You can also set up an optional SecondaryNameNode, used for periodic handshaking with NameNode for fault tolerance. The rest of the machines within the cluster act as both DataNodes and TaskTrackers. The DataNode holds the system data; each data node manages its own locally scoped storage, or its local hard disk. The TaskTrackers carry out map and reduce operations.

Writing a Hadoop MapReduce application

The surest way to understand how Hadoop works is to walk through the process of writing a Hadoop MapReduce application. For the remainder of this article, we'll be working with EchoOhce, a simple MapReduce application that can reverse many strings. The input strings to be reversed represent the large amount of data that MapReduce applications typically work with. The example divides the data into different nodes, performs the reversal operations, combines the result strings, and then outputs the results. This application provides an opportunity to examine all of the main concepts of Hadoop. After you understand how it works, you'll see how it can be deployed in different modes.

First, take a look at the package declaration and imports in Listing 3. The EchoOhce class is in the com.javaworld.mapreduce package.

Listing 3. Package declarations for EchoOhce

package com.javaworld.mapreduce;

import java.io.IOException;import java.util.ArrayList;import java.util.Iterator;import java.util.List;import java.util.StringTokenizer;import java.io.*;import java.net.*;import java.util.regex.MatchResult;

Page 7: MapReduce Programming With Apache Hadoop

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;

The first set of imports is for the standard Java classes, and the second set is for the MapReduce implementation.

The EchoOhce class begins by extending org.apache.hadoop.conf.Configured and implementing the interface org.apache.hadoop.until.Tool, as you can see in Listing 4.

Listing 4. Extending Configured, implementing Tool

public class EchoOhce extends Configured implements Tool {//..your code goes here}

The Configured class is responsible for delivering the configuration parameters specified in certain XML files. This is done when the programmer invokes the getConf() method of this class. This method returns an instance of org.apache.hadoop.conf.Configuration, which is basically a holder for the resources specified as name-value pairs in XML data. Each resource is named by either a String or by an org.apache.hadoop.fs.Path instance.

By default, the two resources loaded in order from the classpath are:

hadoop-default.xml: This file contains read-only defaults for Hadoop, like global properties, logging properties, I/O properties, filesystem properties, and the like. If you want to use your own values for any of these properties, you can override them in hadoop-site.xml.

hadoop-site.xml: Here you can override the values in hadoop-default.xml that do not meet your specific objectives.

Please note that applications may add additional resources -- as many as you want. Those are loaded in order from the classpath. You can find out more from the Hadoop API documentation for the addResource() and addFinalResource() methods.

Page 8: MapReduce Programming With Apache Hadoop

addFinalResource() allows the flexibility for declaring a resource to be final so that subsequently loaded resources cannot alter that value.

You might have noticed that the code implements an interface named Tool. This interface supports a variety of methods to handle generic command-line options. The interface forces the programmer to write a method, run(), that takes in String arrays as parameters and returns an int. The integer returned will determine whether the execution has been successful or not. Once you've implemented the run() method in your class, you can write your main() method, as in Listing 5.

Listing 5. main() method

public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new EchoOhce(), args); System.exit(res); }

The org.apache.hadoop.util.ToolRunner class invokes the run() method implemented in the EchoOhce class. The ToolRunner utility helps to run classes that implement the Tool interface. With this facility, developers can avoid writing a custom handler to process various input options.

Map and reduce

Now you can jump into the actual MapReduce implementation. You're going to write two inner classes within the EchoOhce class. They are:

Map: Includes functionality for processing input key- value pairs to generate output key-value pairs.

Reduce: Includes functionality for collecting output from parallel map processing and outputting that collected data.

Figure 1 illustrates how the sample app will work.

Figure 1. Map and Reduce in action (click to enlarge)

First, take a look at the Map class in Listing 6.

Listing 6. Map class

Page 9: MapReduce Programming With Apache Hadoop

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {

private Text inputText = new Text(); private Text reverseText = new Text();

public void map(LongWritable key, Text inputs, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

String inputString = inputs.toString(); int length = inputString.length(); StringBuffer reverse = new StringBuffer(); for(int i=length-1; i>=0; i--) { reverse.append(inputString.charAt(i)); } inputText.set(inputString); reverseText.set(reverse.toString()); output.collect(inputText,reverseText); } }

As mentioned earlier, the EchoOhce application must take an input string, reverse it, and return a key-value pair with input and reverse strings together. First, it gets the parameters for the map() function -- namely, the inputs and the output. From the inputs, it gets the input String. The application uses the simple Java API to find the reverse of this String, then creates a key-value pair by setting the input String and the reverse String. You end up with an OutputCollector instance, which contains the result of this processing. Assume that this is one result obtained from one execution of the map() function on one of the nodes.

Obviously, you'll need to combine all such outputs. This is exactly what the reduce() method of the Reduce class, shown in Listing 7, will do.

Listing 7. Reduce.reduce()

public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } } }

The MapReduce framework knows how many OutputCollectors there are and which are to be combined for the final result. The reduce() method actually does the grunt work.

Page 10: MapReduce Programming With Apache Hadoop

Finally, to complete EchoOhce's Main class, you need to set the values for your configuration. Basically, these values inform the MapReduce framework about the types of the output keys and values, the names of the Map and Reduce classes, and so on. The complete run() method is shown in Listing 8.

Listing 8. run()

public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), EchoOhce.class); conf.setJobName("EchoOhce");

...

Download complete Listing 8.

As you can see in the listing, you must first create a Configuration instance; org.apache.hadoop.mapred.JobConf extends from the Configuration class. JobConf has the primary responsibility of sending your map and reduce implementations to the Hadoop framework for execution. Once the JobConf instance has been given the appropriate values for your MapReduce implementation, you invoke the most important method, named runJob(), on the org.apache.hadoop.mapred.JobClient class, by passing in the JobConf instance. JobClient internally communicates with the org.apache.hadoop.mapred.JobTracker class, and provides facilities for submission of jobs, tracking progress, accessing the progress or logs, or getting cluster status.

That should give you a good sense of how EchoOhce, a sample MapReduce application, works. We'll conclude with instructions for installing the relevant software and running the application.

Installing a MapReduce application in standalone mode

Unlike a Java EE application that can easily be deployed onto an app server, a MapReduce application using Hadoop requires some extra steps for deployment. First, you should understand the default, out-of-the-box way that Hadoop operates: standalone mode. The following steps describe how to set up the application on a Windows XP Professional system; the process would be almost identical for any other environment, with an important exception that you'll learn more about in a moment.

1. Ensure that version 5.0 or above of Java is installed on your machine. 2. Download the latest version of Hadoop . At the time that this article was

published, the latest distribution was version 0.18.0. Save the downloads into a directory -- this example will use D:\hadoop.

3. Make sure that you're logged in with an OS user name that doesn't contain spaces. For example, a username like "Ravi" should be used rather than "Ravi Shankar". This is to avoid some problems (which will be fixed in later versions) while using

Page 11: MapReduce Programming With Apache Hadoop

SSH communication. Please also make sure that your system uses a username and password to log on at startup. Do not bypass authentication. SSH will synchronize with Windows login while doing some handshakes.

4. As mentioned earlier, you will need to have an execution environment for shell scripts. If you're using a Unix-like OS, you will already have a command line available to you; but on a Windows machines, you will need to install the Cygwin tools. Download the Cygwin package, making sure that you have selected the openSSH package (under the NET category) before you begin. For the other packages, you can simply use the defaults.

5. In this example, Java has been installed in D:\Tiger. You need to make Hadoop aware of this directory. Go to your Hadoop installation in the D:\hadoop directory, then to the conf subdirectory. Open the file named hadoop-env.sh and change the value of JAVA_HOME (uncommenting, if necessary) to the following:

6. JAVA_HOME = /cygdrive/d/Tiger

(Note that /cygdrive prefix. This is how Cygwin maps your Windows directory to a Unix-style directory format.)

7. Start Cygwin by choosing Start > All Programs > Cygwin > Cygwin Bash Shell.

8. In Hadoop, communication between different processes across different machines is achieved in through SSH, so the next important step is to get sshd running. If you're using SSH for the first time, please note that sshd needs a config file to run, which is generated by the following command:

9. ssh-host-config

When you enter this, you will get a prompt usually asking for the value for CYGWIN. Enter ntsec tty. If you are again prompted with a question on the privilege separation that should be used, your answer should be no. If asked for your consent for installing SSH as a service, give yes as your response.

Once this has been set up, start the sshd service by typing:

/usr/sbin/sshd

To make sure that sshd is running, check the process status:

ps | grep sshd

10. If sshd is running, you can try to SSH to localhost:

11. ssh localhost

If you're asked for a passphrase to SSH to the localhost, press Ctrl-C and enter:

ssh-keygen -t dsa -P ' ' -f /.ssh/id_dsacat /.ssh/id_dsa.pub >> /.ssh/authorized_keys

Page 12: MapReduce Programming With Apache Hadoop

12. Try running the example programs available at the Hadoop site. If all of the above steps have gone as they should, you should be get the expected output.

13. Now it's time to create the input data for the EchoOhce application:14. echo "Hello" >> word115. echo "World" >> word216. echo "Goodbye" >> word317. echo "JavaWorld" >> word4

18. Next, you need to put the files you created in Step 10 into HDFS after creating a directory. Note that you do not need to create any partitions for HDFS. It comes as part of the Hadoop installation, and all you need to do is execute the following commands:

19. bin/hadoop dfs -mkdir words20. bin/hadoop dfs -put word1 words/21. bin/hadoop dfs -put word2 words/22. bin/hadoop dfs -put word3 words/23. bin/hadoop dfs -put word4 words/

24. Next, create a JAR file for the sample application. As an easy and extensible approach, create two environment variables in your machine, HADOOP_HOME and HADOOP_VERSION. (For the sample under consideration, the values will be D:\Hadoop and 0.17.1, respectively.) Now you can create EchoOhce.jar with the following commands:

25. mkdir EchoOhce_classes26. javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-

core.jar -d EchoOhce_classes EchoOhce.java27. jar -cvf EchoOhce.jar -C EchoOhce_classes/

28. Finally, its time to see the output. Run the application with the following command:

29. bin/hadoop jar EchoOhce.jar com.javaworld.mapreduce.EchoOhce words result

You will see an output screen with details like the following:

08/07/18 11:14:45 INFO streaming.StreamJob: map 0% reduce 0%08/07/18 11:14:52 INFO streaming.StreamJob: map 40% reduce 0%08/07/18 11:14:53 INFO streaming.StreamJob: map 80% reduce 0%08/07/18 11:14:54 INFO streaming.StreamJob: map 100% reduce 0%08/07/18 11:15:03 INFO streaming.StreamJob: map 100% reduce 100%08/07/18 11:15:03 INFO streaming.StreamJob: Job complete: job_20080718003_000708/07/18 11:15:03 INFO streaming.StreamJob: Output: result

Now go to result directory, and look in the file named result. It should contain the following:

Hello olleHWorld dlroWGoodbye eybdooGJavaWorld dlroWavaJ

Installing a MapReduce application in real cluster mode

Page 13: MapReduce Programming With Apache Hadoop

Running the sample application in standalone mode will prove that things are working properly, but it isn't really very exciting. To really demonstrate the power of Hadoop, you'll want to execute it in real cluster mode.

1. Pick six open port numbers that you can use; this example will use 8000 through 8005. (If the details from the netstat command reveal that these are not available, please feel free to use any six of your choice.) You will need four machines, MACH1 to MACH4, all interconnected either through a cable or wireless LAN. In the sample scenario described here, they are connected via a home network.

2. MACH1 will be the NameNode, and MACH2 will be the JobTracker. As mentioned earlier, in a cluster-based environment there will be only one of each.

3. Open the file named hadoop-site.xml under the conf directory of your Hadoop installation. Change the values to match those shown in Listing 9.

Listing 9. hadoop-site.xml

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration> <property> <name>fs.default.name</name> <value>MACH1:8000</value> </property> <property> <name>mapred.job.tracker</name> <value>MACH2:8000</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.secondary.info.port</name> <value>8001</value> </property> <property> <name>dfs.info.port</name> <value>8002</value> </property> <property> <name>mapred.job.tracker.info.port</name> <value>8003</value> </property> <property> <name>tasktracker.http.port</name> <value>8004</value> </property></configuration>

Page 14: MapReduce Programming With Apache Hadoop

4. Open the file named masters under the conf directory. Here you need to add the master NameNode and JobTracker names, as shown in Listing 10. (If there are existing entries, please replace them with those shown in the listing).

Listing 10. Adding NameNode and JobTracker names

MACH1MACH2

5. Open the file named slaves under the conf directory. This is where you put the names of DataNodes, as shown in Listing 11. (Again, if there are existing entries in this file, please replace them.)

Listing 11. Adding DataNode names

MACH3MACH4

6. Now you're ready to go, and it's time to start the Hadoop cluster. Log on to each node, accepting the defaults. Log into your NameNode as follows:

7. ssh MACH1

Now go to Hadoop directory:

cd /hadoop0.17.1/

Execute the start script to launch HDFS:

bin/start-dfs.sh

(Note that you can stop this later with the stop-dfs.sh command.)

8. Start the JobTracker exactly as above, with the following commands:

9. ssh MACH210. cd /hadoop0.17.1/11. bin/start-mapred.sh

(Again, this can be stopped later by the corresponding stop-mapred.sh command.)

You can now execute the EchoOche application as described in the previous section, in the same way. The difference is that now the program will be executed across a cluster of DataNodes. You can confirm this by going to the Web interface provided with Hadoop. Point your browser to http://localhost:8002. (The default is actually port 50070; to see why you'd need to use port 8002 here, take a closer look at Listing 9.) You should see a frame similar to the one in Figure 2, showing the details of NameNode and all jobs managed by it.

Page 15: MapReduce Programming With Apache Hadoop

Figure 2. Hadoop Web interface, showing the number of nodes and their status (click to enlarge)

This Web interface will provide many details to browse through, showing you the full statistics of your application. Hadoop comes with several different Web interfaces by default; you can see their default URLs in Hadoop-default.xml. For example, in this sample application, http://localhost:8003 will show you JobTracker statistics. (The default is port is 50030.)

In conclusion

In this article, we've presented the fundamentals of MapReduce programming with the open source Hadoop framework. This excellent framework accelerates the processing of large amounts of data through distributed processes, delivering very fast responses. It can be adopted and customized to meet various development requirements and can be scaled by increasing the number of nodes available for processing. The extensibility and simplicity of the framework are the key differentiators that make it a promising tool for data processing.

About the author

Ravi Shankar is an assistant vice president of technology development, currently working in the financial industry. He is a Sun Certified Programmer and Sun Certified Enterprise Architect with 15 years of industry experience. He has been a presenter at international conferences like JavaOne 2004, IIWAS 2004, and IMECS 2006. Ravi served earlier as a technical member of the OASIS Framework for Web Services Implementation Committee. He spends most of his leisure time exploring new technologies.

Govindu Narendra is a technical architect pursuing development of parallel processing technologies and portal development in data warehousing. He is a Sun Certified Programmer.

All contents copyright 1995-2008 Java World, Inc. http://www.javaworld.com

Page 16: MapReduce Programming With Apache Hadoop