an example hadoop install

Apache Hadoop Install Example

Using Ubuntu 12.04

Java 1.6

Hadoop 1.2.0

Static DNS

3 Machine Cluster

[email protected]

Install Step 1

Install Ubuntu Linux 12.04 on each machine

Assign a hostname and static IP address to each machine

Names used here

hc1nn ( hadoop cluster 1 name node )

hc1r1m1 ( hadoop cluster 1 rack 1 machine 1 )

hc1r1m2 ( hadoop cluster 1 rack 1 machine 2 )

Install ssh daemon on each server

Install vsftpd ( ftp ) deamon on each server

Update /etc/host with all hostnames on each server

[email protected]

Install Step 2

Generate ssh keys for each server under hadoop user

Copy keys to all server's hadoop account

Install java 1.6 ( we used openjdk )

Obtain the Hadoop software from

hadoop.apache.org

Unpack Hadoop software to /usr/local

Now consider cluster architecture

[email protected]

Install Step 3

Start will three single installs

For Hadoop

Then cluster the

Hadoop machines

[email protected]

Install Step 4

Ensure auto shh

From name node (hc1nn) to both data nodes

From each machine to itself

Create symbolic link

Named hadoop

Pointing to /usr/local/hadoop-1.2.0

Set up Bash .bashrc on each machine hadoop user set

HADOOP_HOME

JAVA_HOME

[email protected]

Install Step 5

Create Hadoop tmp dir on all servers

sudo mkdir -p /app/hadoop/tmp

sudo chown hadoop:hadoop /app/hadoop/tmp

sudo chmod 750 /app/hadoop/tmp

Set Up conf/core-site.xml

( on all servers )

[email protected]

Install Step 5

hadoop.tmp.dir

/app/hadoop/tmp

A base for other temporary directories.

fs.default.name

hdfs://localhost:54310

The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.

[email protected]

Install Step 6

Set Up conf/mapred-site.xml

( on all servers )

mapred.job.tracker

localhost:54311

The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

[email protected]

Install Step 7

Set Up conf/hdfs-site.xml

( on all servers )

dfs.replication

1

Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

[email protected]

Install Step 8

Format the Hadoop file system ( on all servers )

hadoop namenode -format

Dont do this on a running HDFS you will lose all data !!

Now start Hadoop ( on all servers )

$HADOOP_HOME/bin/start-all.sh

Check Hadoop is running with

sudo netstat -plten | grep java

you should see ports like 54310 and 54311 being used.

All Good ? Stop Hadoop on all servers

$HADOOP_HOME/bin/stop-all.sh

[email protected]

Install Step 9

Now set up the cluster do on all servers

Set $HADOOP_HOME/conf/masters file to contain

hc1nn

Set $HADOOP_HOME/conf/slaves file to contain

hc1r1m1

hc1r1m2

hc1nn

We will be using the name node as a data node as well

[email protected]

Install Step 10

on all machines

Change conf/core-site.xml on all machines

fs.default.name = hdfs://hc1nn:54310

Change conf/mapred-site.xml

mapred.job.tracker = hc1nn:54311

Change conf/hdfs-site.xml

dfs.replication = 3

[email protected]

Install Step 11

Now reformat the HDFS on hc1nn

hadoop namenode -format

On name node start HDFS

$HADOOP_HOME/bin/start-dfs.sh

On name node start Map Reduce

$HADOOP_HOME/bin/start-mapred.sh

[email protected]

Install Step 12

Run a test Map Reduce job

I have data in /tmp/gutenberg

Load Data into HDFS

hadoop dfs -copyFromLocal /tmp/gutenberg /usr/hadoop/gutenberg

List Data in HDFS

hadoop dfs -ls /usr/hadoop/gutenberg

Found 18 items

-rw-r--r-- 3 hadoop supergroup 674389 2013-07-30 19:31 /usr/hadoop/gutenberg/pg20417.txt

-rw-r--r-- 3 hadoop supergroup 674389 2013-07-30 19:31 /usr/hadoop/gutenberg/pg20417.txt1

...............



[email protected]

Install Step 13

Run the Map Reduce job

cd $HADOOP_HOME

hadoop jar hadoop*examples*.jar wordcount /usr/hduser/gutenberg /usr/hduser/gutenberg-output

Check the output

13/07/30 19:34:13 INFO input.FileInputFormat: Total input paths to process : 18

13/07/30 19:34:13 INFO util.NativeCodeLoader: Loaded the native-hadoop library

13/07/30 19:34:14 INFO mapred.JobClient: Running job: job_201307301931_0001

13/07/30 19:34:15 INFO mapred.JobClient: map 0% reduce 0%















[email protected]

Install Step 13





13/07/30 19:35:28 INFO mapred.JobClient: Job complete: job_201307301931_0001

13/07/30 19:35:28 INFO mapred.JobClient: Counters: 29

13/07/30 19:35:28 INFO mapred.JobClient: Job Counters

13/07/30 19:35:28 INFO mapred.JobClient: Launched reduce tasks=1

13/07/30 19:35:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=119572

13/07/30 19:35:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

13/07/30 19:35:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0

13/07/30 19:35:28 INFO mapred.JobClient: Launched map tasks=18

13/07/30 19:35:28 INFO mapred.JobClient: Data-local map tasks=18

13/07/30 19:35:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=61226

13/07/30 19:35:28 INFO mapred.JobClient: File Output Format Counters

13/07/30 19:35:28 INFO mapred.JobClient: Bytes Written=725257

13/07/30 19:35:28 INFO mapred.JobClient: FileSystemCounters

13/07/30 19:35:28 INFO mapred.JobClient: FILE_BYTES_READ=6977160

13/07/30 19:35:28 INFO mapred.JobClient: HDFS_BYTES_READ=17600721

13/07/30 19:35:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=14994585

13/07/30 19:35:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=725257

13/07/30 19:35:28 INFO mapred.JobClient: File Input Format Counters

13/07/30 19:35:28 INFO mapred.JobClient: Bytes Read=17598630

13/07/30 19:35:28 INFO mapred.JobClient: Map-Reduce Framework

[email protected]

Install Step 14

Check the job output

hadoop dfs -ls /usr/hadoop/gutenberg-output

Found 3 items

-rw-r--r-- 3 hadoop supergroup 0 2013-07-30 19:35 /usr/hadoop/gutenberg-output/_SUCCESS

drwxr-xr-x - hadoop supergroup 0 2013-07-30 19:34 /usr/hadoop/gutenberg-output/_logs

-rw-r--r-- 3 hadoop supergroup 725257 2013-07-30 19:35 /usr/hadoop/gutenberg-output/part-r-00000

Now get results out of HDFS

hadoop dfs -cat /usr/hadoop/gutenberg-output/part-r-00000 > /tmp/hrun/cluster_run.txt

head -10 /tmp/hrun/cluster_run.txt

"(Lo)cra"6

"14906

"1498,"6

"35"6

"40,"6

"A12

"AS-IS".6

"A_6

"Absoluti6

"Alack!6

[email protected]

Install Step 15

Congratulations you now have

A working HDFS cluster

With three data nodes

One name node

Tested via a Map Reduce job

Detailed install instructions available from our site shop

[email protected]

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

[email protected]

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems

an example hadoop install

Technology