an example hadoop install
TRANSCRIPT
Apache Hadoop Install Example
Using Ubuntu 12.04
Java 1.6
Hadoop 1.2.0
Static DNS
3 Machine Cluster
Install Step 1
Install Ubuntu Linux 12.04 on each machine
Assign a hostname and static IP address to each machine
Names used here
hc1nn ( hadoop cluster 1 name node )
hc1r1m1 ( hadoop cluster 1 rack 1 machine 1 )
hc1r1m2 ( hadoop cluster 1 rack 1 machine 2 )
Install ssh daemon on each server
Install vsftpd ( ftp ) deamon on each server
Update /etc/host with all hostnames on each server
Install Step 2
Generate ssh keys for each server under hadoop user
Copy keys to all server's hadoop account
Install java 1.6 ( we used openjdk )
Obtain the Hadoop software from
hadoop.apache.org
Unpack Hadoop software to /usr/local
Now consider cluster architecture
Install Step 3
Start will three single installs
For Hadoop
Then cluster the
Hadoop machines
Install Step 4
Ensure auto shh
From name node (hc1nn) to both data nodes
From each machine to itself
Create symbolic link
Named hadoop
Pointing to /usr/local/hadoop-1.2.0
Set up Bash .bashrc on each machine hadoop user set
HADOOP_HOME
JAVA_HOME
Install Step 5
Create Hadoop tmp dir on all servers
sudo mkdir -p /app/hadoop/tmp
sudo chown hadoop:hadoop /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp
Set Up conf/core-site.xml
( on all servers )
Install Step 5
hadoop.tmp.dir
/app/hadoop/tmp
A base for other temporary directories.
fs.default.name
hdfs://localhost:54310
The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.
Install Step 6
Set Up conf/mapred-site.xml
( on all servers )
mapred.job.tracker
localhost:54311
The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
Install Step 7
Set Up conf/hdfs-site.xml
( on all servers )
dfs.replication
1
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
Install Step 8
Format the Hadoop file system ( on all servers )
hadoop namenode -format
Dont do this on a running HDFS you will lose all data !!
Now start Hadoop ( on all servers )
$HADOOP_HOME/bin/start-all.sh
Check Hadoop is running with
sudo netstat -plten | grep java
you should see ports like 54310 and 54311 being used.
All Good ? Stop Hadoop on all servers
$HADOOP_HOME/bin/stop-all.sh
Install Step 9
Now set up the cluster do on all servers
Set $HADOOP_HOME/conf/masters file to contain
hc1nn
Set $HADOOP_HOME/conf/slaves file to contain
hc1r1m1
hc1r1m2
hc1nn
We will be using the name node as a data node as well
Install Step 10
on all machines
Change conf/core-site.xml on all machines
fs.default.name = hdfs://hc1nn:54310
Change conf/mapred-site.xml
mapred.job.tracker = hc1nn:54311
Change conf/hdfs-site.xml
dfs.replication = 3
Install Step 11
Now reformat the HDFS on hc1nn
hadoop namenode -format
On name node start HDFS
$HADOOP_HOME/bin/start-dfs.sh
On name node start Map Reduce
$HADOOP_HOME/bin/start-mapred.sh
Install Step 12
Run a test Map Reduce job
I have data in /tmp/gutenberg
Load Data into HDFS
hadoop dfs -copyFromLocal /tmp/gutenberg /usr/hadoop/gutenberg
List Data in HDFS
hadoop dfs -ls /usr/hadoop/gutenberg
Found 18 items
-rw-r--r-- 3 hadoop supergroup 674389 2013-07-30 19:31 /usr/hadoop/gutenberg/pg20417.txt
-rw-r--r-- 3 hadoop supergroup 674389 2013-07-30 19:31 /usr/hadoop/gutenberg/pg20417.txt1
...............
-rw-r--r-- 3 hadoop supergroup 834980 2013-07-30 19:31 /usr/hadoop/gutenberg/pg5000.txt4
-rw-r--r-- 3 hadoop supergroup 834980 2013-07-30 19:31 /usr/hadoop/gutenberg/pg5000.txt5
Install Step 13
Run the Map Reduce job
cd $HADOOP_HOME
hadoop jar hadoop*examples*.jar wordcount /usr/hduser/gutenberg /usr/hduser/gutenberg-output
Check the output
13/07/30 19:34:13 INFO input.FileInputFormat: Total input paths to process : 18
13/07/30 19:34:13 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/30 19:34:14 INFO mapred.JobClient: Running job: job_201307301931_0001
13/07/30 19:34:15 INFO mapred.JobClient: map 0% reduce 0%
13/07/30 19:34:26 INFO mapred.JobClient: map 11% reduce 0%
13/07/30 19:34:34 INFO mapred.JobClient: map 16% reduce 0%
13/07/30 19:34:35 INFO mapred.JobClient: map 22% reduce 0%
13/07/30 19:34:42 INFO mapred.JobClient: map 33% reduce 0%
13/07/30 19:34:43 INFO mapred.JobClient: map 33% reduce 7%
13/07/30 19:34:48 INFO mapred.JobClient: map 44% reduce 7%
13/07/30 19:34:52 INFO mapred.JobClient: map 44% reduce 14%
13/07/30 19:34:54 INFO mapred.JobClient: map 55% reduce 14%
13/07/30 19:35:01 INFO mapred.JobClient: map 66% reduce 14%
13/07/30 19:35:02 INFO mapred.JobClient: map 66% reduce 18%
13/07/30 19:35:06 INFO mapred.JobClient: map 72% reduce 18%
13/07/30 19:35:07 INFO mapred.JobClient: map 77% reduce 18%
13/07/30 19:35:08 INFO mapred.JobClient: map 77% reduce 25%
13/07/30 19:35:12 INFO mapred.JobClient: map 88% reduce 25%
Install Step 13
13/07/30 19:35:17 INFO mapred.JobClient: map 88% reduce 29%
13/07/30 19:35:18 INFO mapred.JobClient: map 100% reduce 29%
13/07/30 19:35:23 INFO mapred.JobClient: map 100% reduce 33%
13/07/30 19:35:27 INFO mapred.JobClient: map 100% reduce 100%
13/07/30 19:35:28 INFO mapred.JobClient: Job complete: job_201307301931_0001
13/07/30 19:35:28 INFO mapred.JobClient: Counters: 29
13/07/30 19:35:28 INFO mapred.JobClient: Job Counters
13/07/30 19:35:28 INFO mapred.JobClient: Launched reduce tasks=1
13/07/30 19:35:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=119572
13/07/30 19:35:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/30 19:35:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/07/30 19:35:28 INFO mapred.JobClient: Launched map tasks=18
13/07/30 19:35:28 INFO mapred.JobClient: Data-local map tasks=18
13/07/30 19:35:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=61226
13/07/30 19:35:28 INFO mapred.JobClient: File Output Format Counters
13/07/30 19:35:28 INFO mapred.JobClient: Bytes Written=725257
13/07/30 19:35:28 INFO mapred.JobClient: FileSystemCounters
13/07/30 19:35:28 INFO mapred.JobClient: FILE_BYTES_READ=6977160
13/07/30 19:35:28 INFO mapred.JobClient: HDFS_BYTES_READ=17600721
13/07/30 19:35:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=14994585
13/07/30 19:35:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=725257
13/07/30 19:35:28 INFO mapred.JobClient: File Input Format Counters
13/07/30 19:35:28 INFO mapred.JobClient: Bytes Read=17598630
13/07/30 19:35:28 INFO mapred.JobClient: Map-Reduce Framework
Install Step 14
Check the job output
hadoop dfs -ls /usr/hadoop/gutenberg-output
Found 3 items
-rw-r--r-- 3 hadoop supergroup 0 2013-07-30 19:35 /usr/hadoop/gutenberg-output/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2013-07-30 19:34 /usr/hadoop/gutenberg-output/_logs
-rw-r--r-- 3 hadoop supergroup 725257 2013-07-30 19:35 /usr/hadoop/gutenberg-output/part-r-00000
Now get results out of HDFS
hadoop dfs -cat /usr/hadoop/gutenberg-output/part-r-00000 > /tmp/hrun/cluster_run.txt
head -10 /tmp/hrun/cluster_run.txt
"(Lo)cra"6
"14906
"1498,"6
"35"6
"40,"6
"A12
"AS-IS".6
"A_6
"Absoluti6
"Alack!6
Install Step 15
Congratulations you now have
A working HDFS cluster
With three data nodes
One name node
Tested via a Map Reduce job
Detailed install instructions available from our site shop
Contact Us
Feel free to contact us at
www.semtech-solutions.co.nz
We offer IT project consultancy
We are happy to hear about your problems
You can just pay for those hours that you need
To solve your problems