hadoop shell commands

Upload: srikant4u4670

Post on 04-Nov-2015

38 views

Category:

Documents


1 download

DESCRIPTION

Hadoop shell commands which are very close to Unix shell commands., using these commands user can perform different operations on hdfs. People who are familiar with Unix shell commands can easily get hold of it

TRANSCRIPT

Hadoop Shell Commands Posted on November 26, 2013 by aravindu012 1 Comment Leave a comment This post will help users to learn important and usefulHadoop shell commands which are very close to Unix shell commands., using these commands user can perform different operations on hdfs.People who are familiar with Unix shell commands can easily get hold of it, people who are new to Unix or hadoop need not to worry, just follow this article to learn all the commands used in day to day basis and practice the same.FS ShellThe FileSystem (FS) shell is invoked bybin/hadoop fs . All the FS shell commands take path URIs as arguments. The URI format isscheme://autority/path. For HDFS the scheme ishdfs, and for the local filesystem the scheme isfile. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as/parent/childcan be specified ashdfs://namenodehost/parent/childor simply as/parent/child.

Administrator Commands:fsck /Run a HDFS filesystem checking UtilityExample: hadoop fsck - /

balancerRuns a cluster balancing utility. An administrator can simply press Ctrl-C to stop the re-balancing process.Example: hadoop balancer

versionPrints the hadoop version configured on the machineExample: hadoop version

Hadoop fs commands:lsFor a file returns stat on the file with the following format:filename filesize modification_date modification_time permissions userid groupidFor a directory it returns list of its direct children as in Unix. A directory is listed as:dirname modification_time modification_time permissions userid groupidUsage: hadoop fs -ls Example: hadoop fs -ls

lsrRecursive version ofls. Similar to Unixls -R.Usage: hadoop fs -lsr Example hadoop fs -lsr

mkdirTakes path uris as an argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path.Usage: hadoop fs -mkdir Example: hadoop fs -mkdir /user/hadoop/Aravindu

mvMoves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.Usage: hadoop fs -mv URI [URI ] Example: hadoop fs -mv /user/hduser/Aravindu/Consolidated_Sheet.csv /user/hduser/sandela/

putCopy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem.Usage: hadoop fs -put Example: hadoop fs -put localfile /user/Hadoop/hadoopfile hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdirrmDelete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive deletes.Usage: hadoop fs -rm URI [URI ]Example: hadoop fs -rm /user/hduser/sandela/Consolidated_Sheet.csv

rmrRecursive version of deleting.Usage: hadoop fs -rmr URI [URI ]Example: hadoop fs -rmr /user/hduser/sandela

catThe cat command concatenates and display files, it works similar to Unix catcommand:Usage: hadoop fs -cat URI [URI ]Example: hadoop fs -cat /user/hduser/Aravindu/Consolidated_sheet.csv

chgrpUsage: hadoop fs -chgrp [-R] GROUP URI [URI ]

chmodChange the permissions of files. With-R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user.Usage: hadoop fs -chmod [-R] URI [URI ]

chownChange the owner of files. With-R, make the change recursively through the directory structure. The user must be a super-user.Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

copyFromLocalCopies file form local machine and paste in given hadoop directoryUsage: hadoop fs -copyFromLocal URIExample: hadoop fs copyFromLocal /home/hduser/contact_details/output/job_titles/11-26-2013part-r-00000 /user/hduser/finaloutput_112613/

Similar toputcommand.copyToLocalCoppies file from hadoop directory and paste the file in local direcotryUsage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI Example: hadoop fs copyToLocal /user/hduser/output/Expecting_result_set_112613/part-r-00000 /home/hduser/contact_wordcount/11-26-2013

Similar to getcommand.cpCopy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.Usage: hadoop fs -cp URI [URI ] Example:hadoop fs -cp /user/hduser/input/Consolodated_Sheet.csv /user/hduser/Aravindu hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

duDisplays aggregate length of files contained in the directory or the length of a file in case its just a file.Usage: hadoop fs -du URI [URI ]Example: hadoop fs -du /user/hduser/Aravindu/Consolidated_Sheet.csv

dusDisplays a summary of file lengths.Usage: hadoop fs -dus Example: hadoop fs dus /user/hduser/Aravindu/Consolidated_Sheet.csv

count:Count the number of directories, files and bytes under the paths that match the specified file patternExample: hadoop fs count hdfs:/

expungeEmpty the Trash.Usage: hadoop fs -expungeExample: hadoop fs expunge

setrepChanges the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory.Usage: hadoop fs -setrep [-R] Example: hadoop fs -setrep -w 3 -R /user/hadoop/AravindustatReturns the stat information on the path.Usage: hadoop fs -stat URI [URI ]Example: hadoop fs -stat /user/hduser/Aravindu

tailDisplays last kilobyte of the file to stdout. -f option can be used as in Unix.Usage: hadoop fs -tail [-f] URIExample: hadoop fs -tail /user/hduser/Aravindu/Info

textTakes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.Usage: hadoop fs -text Example: hadoop fs text /user/hduser/Aravindu/Info

touchzCreate a file of zero length.Usage: hadoop fs -touchz URI [URI ]Example: hadoop -touchz /user/hduser/Aravind/emptyfile

Installing single node Hadoop 2.2.0 on Ubuntu Posted on November 2, 2013 by aravindu012 10 Comments Leave a comment Please find the complete step by step process for installing Hadoop 2.2.0 stable version on Ubuntu as requested by many of this blog visitors, friends and subscribers.

Apache Hadoop 2.2.0 release has significant changes compared to its previous stable release, which is Apache Hadoop 1.2.1(Setting up Hadoop 1.2.1 can be found here).In short , this release has a number of changes compared to its earlier version 1.2.1: YARN A general purpose resource management system for Hadoop to allow MapReduce and other data processing frameworks like Hive, Pig and Services High Availability for HDFS HDFS Federation, Snapshots NFSv3 access to data in HDFS Introduced Application Manager to manage the application life cycle Support for running Hadoop on Microsoft Windows HDFS Symlinks feature is disabled & will be taken out in future versions Jobtracker has been replaced with Resource Manager and Node ManagerBefore starting into setting up Apache Hadoop 2.2.0, please understand the concepts of Big Data and Hadoop from my previous blog posts:Big Data Characteristics, Problems and Solution.What is Apache Hadoop?.Setting up Single node Hadoop Cluster.Setting up Multi node Hadoop Cluster.Understanding HDFS architecture (in comic format).Setting up the environment:In this tutorial you will know step by step process for setting up a Hadoop Single Node cluster, so that you can play around with the framework and learn more about it.In This tutorial we are using following Software versions, you can download same by clicking the hyperlinks: Ubuntu Linux12.04.3 LTS Hadoop 2.2.0, released in October, 2013If you are using putty to access your Linux box remotely, please installopensshby running this command, this also helps in configuring SSH access easily in the later part of the installation:sudo apt-get install openssh-serverPrerequisites:1. Installing Java v1.72. Adding dedicated Hadoop system user.3. Configuring SSH access.4. Disabling IPv6.Before starting of installing any applications or softwares, please makes sure your list of packages from all repositories and PPAs is up to date or if not update them by using this command:sudo apt-get update

1. Installing Java v1.7:For running Hadoop it requires Java v1. 7+a. Download Latest oracle Java Linux version of the oracle website by using this commandwget https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gzIf it fails to download, please check with this given command which helps to avoid passing username and password.wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz"

b. Unpack the compressed Java binaries, in the directory:sudo tar xvzf jdk-7u25-linux-x64.tar.gz

c. Create a Java directory using mkdir under /user/local/ and change the directory to /usr/local/Java by using this commandmkdir -R /usr/local/Javacd /usr/local/Java

d. Copy the Oracle Java binaries into the /usr/local/Java directory.sudo cp -r jdk-1.7.0_45 /usr/local/java

e. Edit the system PATH file /etc/profile and add the following system variables to your system pathsudo nano /etc/profile or sudo gedit /etc/profile

f. Scroll down to the end of the file using your arrow keys and add the following lines below to the end of your /etc/profile file:JAVA_HOME=/usr/local/Java/jdk1.7.0_45PATH=$PATH:$HOME/bin:$JAVA_HOME/binexport JAVA_HOMEexport PATH

g. Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. This will tell the system that the new Oracle Java version is available for use.sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.7.0_45/bin/javac" 1sudo update-alternatvie --set javac /usr/local/Java/jdk1.7.0_45/bin/javac

This command notifies the system that Oracle Java JDK is available for useh. Reload your system wide PATH /etc/profile by typing the following command:. /etc/profileTest to see if Oracle Java was installed correctly on your system.Java -version

2. Adding dedicated Hadoop system user.We will use a dedicated Hadoop user account for running Hadoop. While thats not required but it is recommended, because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.a. Adding group:sudo addgroup Hadoopb. Creating a user and adding the user to a group:sudo adduser ingroup Hadoop hduserIt will ask to provide the new UNIX password and Information as shown in below image.

3.Configuring SSH access:The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them and also local machine if you want to use Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.Before this step you have to make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.Generating an SSH key for the hduser user.a. Login as hduser with sudob. Run this Key generation command:ssh-keyegen -t rsa -P ""

c. It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at /home/hduser/ .sshd. Enable SSH access to your local machine with this newly created key.cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keyse. The final step is to test the SSH setup by connecting to your local machine with the hduser user.ssh hduser@localhostThis will addlocalhostpermanently to the list of known hosts

4.Disabling IPv6.We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:sudo gedit /etc/sysctl.confAdd the following lines to the end of the file and reboot the machine, to update the configurations correctly.#disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1

Hadoop Installation:Go toApache Downloadsand downloadHadoop version 2.2.0 (prefer to download any stable versions)i. Run this following command to download Hadoop version 2.2.0wget http://apache.mirrors.pair.com/hadoop/common/stable2/hadoop-2.2..tar.gz

ii. Unpack the compressed hadoop file by using this command:tar xvzf hadoop-2.2.0.tar.gz

iii. move hadoop-2.2.0 to hadoop directory by using give commandmv hadoop-2.2.0 hadoop

iv. Move hadoop package of your choice, I picked/usr/localfor my conveniencesudo mv hadoop /usr/local/

v. Make sure to change the owner of all the files to the hduser user and hadoop group by using this command:sudo chown -R hduser:hadoop Hadoop

Configuring Hadoop:The following are the required files we will use for the perfect configuration of the single node Hadoop cluster.a. yarn-site.xml:b. core-site.xmlc. mapred-site.xmld. hdfs-site.xmle. Update $HOME/.bashrcWe can find the list of files in Hadoop directory which is located incd /usr/local/hadoop/etc/hadoop

a.yarn-site.xml:

yarn.nodemanager.aux-servicesmapreduce_shuffle

yarn.nodemanager.aux-services.mapreduce.shuffle.classorg.apache.hadoop.mapred.ShuffleHandler

b.core-site.xml:i. Change the user to hduser. Change the directory to /usr/local/hadoop/conf and edit the core-site.xml file.vi core-site.xmlii. Add the following entry to the file and save and quit the file:

fs.default.namehdfs://localhost:9000

c. mapred-site.xml:If this file does not exist, copy mapred-site.xml.template as mapred-site.xmli. Edit the mapred-site.xml filevi mapred-site.xmlii. Add the following entry to the file and save and quit the file.

mapreduce.framework.nameyarn

d.hdfs-site.xml:i. Edit the hdfs-site.xml filevi hdfs-site.xmlii. Create two directories to be used by namenode and datanode.mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenodesudo mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenodemkdir -p $HADOOP_HOME/yarn_data/hdfs/datanodeiii. Add the following entry to the file and save and quit the file:

dfs.replication1

dfs.namenode.name.dirfile:/usr/local/hadoop/yarn_data/hdfs/namenode

dfs.datanode.data.dirfile:/usr/local/hadoop/yarn_data/hdfs/datanode

e.Update $HOME/.bashrci. Go back to the root and edit the .bashrc file.vi .bashrc

ii. Add the following lines to the end of the file.Add below configurations:# Set Hadoop-related environment variablesexport HADOOP_PREFIX=/usr/local/hadoopexport HADOOP_HOME=/usr/local/hadoopexport HADOOP_MAPRED_HOME=${HADOOP_HOME}export HADOOP_COMMON_HOME=${HADOOP_HOME}export HADOOP_HDFS_HOME=${HADOOP_HOME}export YARN_HOME=${HADOOP_HOME}export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop# Native Pathexport HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/nativeexport HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"#Java pathexport JAVA_HOME='/usr/locla/Java/jdk1.7.0_45'# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

Formatting and Starting/Stopping the HDFS filesystem via the NameNode:i. The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. You need to do this the first time you set up a Hadoop cluster.Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run thehadoop namenode -format

ii. Start Hadoop Daemons by running the following commands:Name node:$ hadoop-daemon.sh start namenodeData node:$ hadoop-daemon.sh start datanodeResource Manager:$ yarn-daemon.sh start resourcemanager

Node Manager:$ yarn-daemon.sh start nodemanagerJob History Server:$ mr-jobhistory-daemon.sh start historyserver

v. Stop Hadoop by running the following commandstop-dfs.shstop-yarn.shHadoop Web Interfaces:Hadoop comes with several web interfaces which are by default available at these locations: HDFS Namenode and check health using http://localhost:50070 HDFS Secondary Namenode status using http://localhost:50090

By this we are done in setting up a single node hadoop cluster v2.2.0, hope this step by step guide helps you to setup same environment at your end.Please leave a comment/suggestion in the comment section,will try to answer asapand dont forget to subscribe for the newsletter and a facebook like Setting up Hive Posted on October 28, 2013 by aravindu012 No Comments Leave a comment As I said earlier,Apache Hive is an open-source data warehouse infrastructure built on top of Hadoop for providing data summary, query, and analyzing large datasets stored in Hadoop files, it is developed by Facebook and it provides Tools to enable easy data extract/transform/load (ETL) A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBase Query execution via MapReduce

In this post we will get to know about, how to setup Hive on top of Hadoop clusterObjectiveThe objective of this tutorial is for setting up Hive and running HiveQL scripts.PrerequisitesThe following are the prerequisites for setting up Hive.You should have the latest stable build of Hadoop up and running, to install hadoop, please check my previousblog article on Hadoop Setup.Setting up Hive:Procedure1. Download a stable version of the hive file fromapache download mirrors, For this tutorial we are using Hive-0.12.0,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.Xwget http://apache.osuosl.org/hive/hive-0.12.0/hive-0.12.0.tar.gz

2. Unpack the compressed hive in home directory:tar xvzf hive-0.12.0.tar.gz

3. Create a hivedirectory under usr/local directory as root user and change the ownership to hduser as shown, this is for our convenience to differentiate each framework,software and application with different users.cd /usr/localmkdir hivesudo chown -R hduser:hadoop /usr/local/hive

4. Login as hduser and move the uncompressed hive-0.12.0 to /usr/local/hive foldermv hive-0.12.0/ /usr/local/hive5. set HIVE_HOME in $HOME/.bashrc so it will be set every time you login.$ .bashrcAdd the following entries to the .bashrc file.export HIVE_HOME='/usr/local/hive/hive-0.12.0'export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:PATH7. compile .bashrc file using this command:. .bashrcSetting up hive on top of hadoop has takencare, lets test it:8. Start hive by executing the following command.hive9. table in hive by the following command. Also after creating check if the table exists.create table test (field1 string, field2 string);show tables;10. Show extended details on the tableDescribe extended test;By this output we know that hive was setup correctly on top of Hadoop cluster, its time to learn the HiveQL.What is Apache Hive? Posted on October 28, 2013 by aravindu012 No Comments Leave a comment

Apache Hive is an open-source data warehouse infrastructure built on top of Hadoop for providing data summary, query, and analyzing large datasets stored in Hadoop files, it is developed by Facebook and it provides Tools to enable easy data extract/transform/load (ETL) A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBase Query execution via MapReduceIt supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries. Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore.According to the Apache Hive wiki, Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs).Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way.)There is a SqlDevloper/Toad kind of tool named HiveDeveloper, developed by Stratapps Inc which gives users the power to visualize their data stored in Hadoop as table views and do many more operations.In my next blog I will be explaining about how to setup Hive on top of Hadoop cluster, before that please check how to setup Hadoop in my previous blog post so that you will be ready to configure Hive on top of it.Setting up Pig Posted on October 28, 2013 by aravindu012 No Comments Leave a comment

Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in hadoop using Pig Latin language, this layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications, due to its simple interface, support for doing complex operations such as joins and filters, which has the following key properties: Ease of programming.Pig programs are easy to write and which accomplish huge tasks as its done with other Map-Reducing programs. Optimization:System optimize pig jobs execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility: PigUsers can write their own user defined functions (UDF) to do special-purpose processing as per the requirement using Java/Phyton and JavaScript.ObjectiveThe objective of this tutorial is for setting up Pig and running Pig scripts.PrerequisitesThe following are the prerequisites for setting up Pig and running Pig scripts. You should have the latest stable build of Hadoop up and running, to install hadoop, please check my previousblog article on Hadoop Setup.Setting up PigProcedure1. Download a stable version of Pig file from apache download mirrors, For this tutorial we are using pig-0.11.1,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.Xwget http://apache.mirrors.hoobly.com/pig/pig-0.11.1/pig-0.11.1.tar.gz

2. Copy the pig binaries into the /usr/local/pig directory.cp -r pig-0.11.1.tar.gz /usr/local/pig3. Change the directory to /usr/local/pig by using this commandcd /usr/local/pig4. Unpack the compressed pig, in the directory /usr/local/pigsudo tar xvzf pig-0.11.1.tar.gz

5. set PIG_HOME in $HOME/.bashrc so it will be set every time you login.Add the following line to it.export PIG_HOME=

e.g.export PIG_HOME='/usr/local/pig/pig-0.11.1'export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$JAVA_HOME/bin:$PATH

6. Set the environment variable JAVA_HOME to point to the Java installation directory, which Pig uses internally.export JAVA_HOME=Execution ModesPig has two modes of execution local mode and MapReduce mode.Local ModeLocal mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem.To run in local mode, please pass the following command:$ pig -x local grunt>MapReduce ModeThis is the default mode Pig translates the queries into MapReduce jobs, which requires access to a Hadoop cluster.$ pig2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:532013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main Logging error messages to: /home/hduser/pig_1382985584762.log2013-10-28 11:39:44,797 [main] INFO org.apache.pig.impl.util.Utils Default bootup file /home/hduser/.pigbootup not found2013-10-28 11:39:45,094 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine Connecting to hadoop file system at: hdfs://Hadoopmaster:543102013-10-28 11:39:45,592 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine Connecting to map-reduce job tracker at: Hadoopmaster:54311grunt>

You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt is an interactive shell for your Pig queries. You can run Pig programs in three ways via Script, Grunt, or embedding the script into Java code. Running in Interactive shell is shown in the Problem section. To run a batch of pig scripts, it is recommended to place them in a single file with .pig extension and execute them in batch mode, will explain them in depth in coming posts.What is Apache Pig? Posted on October 28, 2013 by aravindu012 No Comments Leave a comment

Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in hadoop using Pig Latin language, this layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications, due to its simple interface, support for doing complex operations such as joins and filters, which has the following key properties: Ease of programming.Pig programs are easy to write and which accomplish huge tasks as its done with other Map-Reducing programs. Optimization:System optimize pig jobs execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility: PigUsers can write their own user defined functions (UDF) to do special-purpose processing as per the requirement using Java/Phyton and JavaScript.How it works:Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin, which abstracts from the Java MapReduce idiom into a form similar to SQL. Pig Latin is a flow language which allows you to write a data flow that describes how your data will be transformed. Since Pig Latin scripts can be graphs it is possible to build complex data flows involving multiple inputs, transforms, and outputs. Users can extend Pig Latin by writing their own User Defined functions, using Java, Python, Ruby, or other scripting languages.We can run Pig in two modes:Local ModeLocal mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem.MapReduce ModeThis is the default mode Pig translates the queries into MapReduce jobs, which requires access to a Hadoop cluster.we will discuss more about pig, setting up pig with hadoop, running PigLatin scripts in Local and MapReduce Mode in my next posts.Hadoop Installation using Cloudera-CDH4 virtual machine Posted on October 27, 2013 by aravindu012 No Comments Leave a comment This document helps to configure Hadoop cluster with help of Cloudera vm in pseudo mode, using Vmware player on a user machine for there practice.Step 1: Download the VMware player from the link shown and install it as shown in the images.Url to download VMware Player(Non-commercial use):https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0|PLAYER-600-A|product_downloads

Step 2: Download the Cloudera Setup File from the given url and extract that zipped file onto your hard drive.Url to download Cloudera VM :http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo&version=1

Step 3: Start VMPlayer and click Open a Virtual Machine.

Browse the extracted folder and select .vmx file as shown.

By clicking on Play virtual machine button, It will start the VM in a couple of minutes, but sometimes you may get this issue This host does not support Intet VT-x

There are two reasons why you are getting this error:1. Your CPU doesnt support it.2. Your CPU does support it, but you have it disabled in the BIOSEnabling Virtualization extensions in the BIOS1. Reboot the computer and open the systems BIOS menu. This can usually be done by pressing the delete key, the F1 key or Alt and F4 keys depending on the system.2. Enabling the Virtualization extensions in the BIOSMany of the steps below may vary depending on your motherboard, processor type, chipset and OEM. Refer to your systems accompanying documentation for the correct information on configuring your system.a. Open the Processor submenu The processor settings menu may be hidden in the Chipset, Advanced CPU Configuration or Northbridge.b. Enable Intel Virtualization Technology (also known as Intel VT-x). AMD-V extensions cannot be disabled in the BIOS and should already be enabled. The Virtualization extensions may be labeled Virtualization Extensions, Vanderpool or various other names depending on the OEM and system BIOS.c. Enable Intel VT-d or AMD IOMMU, if theoptions are available. Intel VT-d and AMD IOMMU are used for PCI device assignment.d. Select Save & Exit.3. Reboot the machine.By following this steps this problem should be solved.Login credentials:Machine Login credentials are:a. Username clouderab. Password clouderaCloudera Manager Credentials are:a. Username adminb. Password adminClick on the black box shown below in the image to start a terminal.

Step 4: Checking your Hadoop Cluster Type: sudo jps to see if all nodes are running (if you see an error, wait for some time and then try again, your threads are not started yet) Type: sudo su hdfs Execute your command ie hadoop fs ls /Step 5: Download the list of Hadoop commands for reference from the given URL.Url do download list of commands:http://hadoop.apache.org/docs/r1.0.4/commands_manual.pdfHadoop multi-node cluster setup Posted on October 27, 2013 by aravindu012 No Comments Leave a comment The following document describes the required steps for setting up a distributed multi-node Apache Hadoop cluster on two Ubuntu machines, the best way to install and setup a multi node cluster is to start installing two individual single node Hadoop clusters by following my previous tutorial setting up hadoop single node cluster on Ubuntu and merge them together with minimal configuration changes in which one Ubuntu box will become the designated master and the other boxs will become a slave, we can add n number of slaves as per our future request.Please follow my previous blog post for setting up hadoop single node cluster on Ubuntu

1. Prerequisitesi.NetworkingNetworking plays an important role here, before merging both single node servers into a multi node cluster we need to make sure that both the node pings each other( they need to be connected on the same network / hub or both the machines can speak to each other). Once we are done with this process, we will be moving to the next step in selecting the master node and slave node, here we are selecting 172.16.17.68 as the master machine(Hadoopmaster) and 172.16.17.61 as a slave (hadoopnode) . Then we need to add them in /etc/hosts file on each machine as follows.sudo vi /etc/hosts

172.16.17.68 Haadoopmaster172.16.17.61 hadoopnodeNote: The addition of more slaves should be updated here in each machine using unique names for slaves (e.g.: 172.16.17.xx hadoonode01, 172.16.17.xy slave02 so on..).

ii. Enabling SSH:hduser on master(Hadoopmaster) machine need to able to connect to its own master (Hadoopmaster) account user and also need to connect hduser to the slave (hadoopnode) machine via password-less SSH login.hduser@Hadoopmaster:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoonode

If you can see the below output when you run the given command on both master and slave, then we configured it correctly.ssh Hadoopmasterssh hadoopnode

2. Configurations:The following are the required files we will use for the perfect configuration of the multi node Hadoop cluster.a. mastersb. slavesc. core-site.xmld. mapred-site.xmle. hdfs-site.xmlLets configure each and every config file accordingly:a. masters:In master (Hadoopmaster) machine we need to configure masters file accordingly as shown in the image and add the master (Hadoopmaster) node name.vi mastersHadoopmaster

b. slaves:Lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be running as shown:Hadoomasterhadoopnode

If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.Configuring all *-site.xml files:We need to use the same configurations on all the nodes of hadoop cluster, i.e. we need to edit all *-site.xml files on each and every server accordingly.c. core-site.xml:We are changing the host name from localhost to Hadoopmaster, which specifies the NameNode (the HDFS master) host and port.vi core-site.xml

d. hdfs-site.xml:We are changing the replication factor to 2, The default value of dfs.replication is 3. However, we have only two nodes available, so we set dfs.replication to 2.vi hdfs-site.xml

e. mapred-site.xml:We are changing the host name from localhost to Hadoopmaster, which specifies the JobTracker (MapReduce master) host and portvi mapred-site.xml

3. Formatting and Starting/Stopping the HDFS filesystem via the NameNode:The first step to starting up your multinode Hadoop cluster is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the given command.hadoop namenode format4. Starting the multi-node cluster:Starting the cluster is performed in two steps.We begin by starting the HDFS daemons first, the NameNode daemon is started on Hadoopmaster and DataNode daemons are started on all nodes(slaves).Then we will start the MapReduce daemons, the JobTracker is started on Hadoomaster and TaskTracker daemons are started on all nodes (slaves).a. To start HDFS daemons:start-dfs.shThis will get NameNode up and DataNodes up listed in conf/slaves.

By running jps command, we will see list of java processes running on master and slaves:

b. To start Map Red daemons:start-mapred.shThis will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

By running jps command, we will see list of java processes including JobTracker and TaskTracker running on master and slaves:.

c. To stop Map Red daemons:stop-mapred.shd. To stop HDFS daemons:stop-dfs.sh5. Running a Map-reduce Job:Use a much larger volume of data as inputs as we are running in a cluster.hadoop jar hadoop *examples*. jar wordcount /user/hduser/demo /user/hduser/demo-outputwe can observe namenode,mapreduce,tasktracker process on the webinterface by following given urls http://Hadoopmaster:50070/ web UI of the NameNode daemon http://Hadoopmaster:50030/ web UI of the JobTracker daemon http://Hadoopmaster:50060/ web UI of the TaskTracker daemon*Hadoopmaster can be replaced with the machine ipBy this we are done in setting up a multi-node hadoop cluster, hope this step by step guide helps you to setup same environment at your place.Please leave a comment in the comment section with your doubts, questions and suggestions, will try to answer asapWhat Is Apache Hadoop? Posted on October 24, 2013 by aravindu012 No Comments Leave a comment What Is Apache Hadoop?The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.The Apache Hadoop is a framework that allows for the storing large data sets which are distributed across clusters of computers using simple programming models and written in Java to run on a single computer to large clusters of commodity hardware computers and it derived from papers published by Google and incorporated the features of the Google File System (GFS) and MapReduce paradigm and named it as Hadoop Distributed File System and Hadoop MapReduce.

How does it fit in?

Hadoop Key Characteristics: Scalable New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. Economical Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data. Flexible Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. Reliable When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.(Credit:Cloudera Blog)

Hadoop Eco System:

Hadoop Distributed File System (HDFS):The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware, it is highly fault-tolerant and is designed to be deployed on low-cost hardware, it is designed to store very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in blocks across multiple machines to ensure their availability on the failure of the machine and high availability to parallel processing, It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. HDFS stores data reliably. If individual machines in the cluster fail, data is still being available with data redundancy. HDFS provides fast, scalable access to the information loaded on the clusters. It is possible to serve a larger number of clients by simply adding more machines to the cluster. HDFS integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally whenever needed. HDFS was originally built as infrastructure for the Apache Nutch web search engine project

Assumptions and Goals:HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.Commodity Hardware Failure:Hadoop doesnt require expensive, highly reliable hardware. Its designed to run on clusters of commodity hardware, an HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file systems data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Continuous Data Access:Applications that run on HDFS need continuous access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.

Very Large Data Files:Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. So, HDFS is tuned to support large files.

It is also worth examining the applications for which using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today:Low-latency data access:Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency.Lots of small files:Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block take about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. While storing millions of files are feasible, billions are beyond the capability of current hardware.Multiple writers, arbitrary file modifications:Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file.References: http://blog.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf http://en.wikipedia.org/wiki/Apache_Hadoop http://hadoop.apache.org/ Hadoop The Definitive guide Big Data Characteristics, Problems and Solution Posted on October 23, 2013 by aravindu012 No Comments Leave a comment What is Big Data? Big Data has become a Buzzword these days, data in every possible form, whether through social media, structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, military surveillance, e-commerce and so on, it amounts to around some zettabytes of data! This huge data is what we call as BIG DATA! Big data is nothing but a synonym of a huge and complex data that it becomes very tiresome or slow to capture, store, process, retrieve and analyze it with the help of any relational database management tools or traditional data processing techniques. Lets take a moment to look into the above picture this should explain us what happens in every 60 seconds on the internet. By this we can understand how much data being generated in a second, a minute, a day or a year and how exponentially its generating, as per the analysis by TechNewsDaily we might generate more than 8 Zettabytes of data by 2015. Over the next 10 years: The number of servers worldwide will grow by 10x, Amount of information managed by enterprise data centers will grow by 50x, Number of files enterprise data center handles will grow by 75x (Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.

Big Data characteristics: Volume: BIG DATA depends upon how large it is. It could amount to hundreds of terabytes or even petabytes of information. Velocity: The increasing rate at which data flows into an organization Variety: A common theme in big data systems is that the source data is diverse and doesnt fall into neat relational structures As we are speaking about data size, the above image will help us to understand or correlate about what we are speaking. Big Data problems? As per our earlier discussions we might now understand what is Big Data, now lets discuss what are all the problems we might face with Big data. Traditional systems build within the company for handling the relational databases may not be able to support/scale as data generating with high volume, velocity and variety data. 1. Volume: For instance, Terabytes of Facebook posts or 400 billion annual twitter tweets could mean Big Data! This data need to be stored to analyze and come up with data science reports for different solutions and problem solving approaches. 2. Velocity: Big data requires fast processing. Time factor plays a very crucial role in several organizations. For instance, processing million records on the share market need to able to write this data with the same speed its coming back into the system. 3. Variety: Big Data may not belong to a specific format. It could be in any form such as structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, etc. Till date we have been working with structured data, it might be difficult to handle unstructured or semi structured data with quality and quantity we are generating on a daily basis. How Big Data can handle this situation: Distributed File System (DFS): With the help of distributed file system(DFS) we can divide a large set of data files into smaller blocks and load them into a multiple number of machines which will be ready for parallel processing. Parallel Processing: Data is residing on N number of servers and holds the power of N servers and can be processed parallel for analysis, which helps user to reduce the wait time to generate the final report or analyzed data. Fault Tolerance: One of the primary reasons to use some of the BigData frameworks(ex: Hadoop) to run your jobs is due to its high degree of fault tolerance. Even when running jobs on a large cluster where individual nodes or network components may experience high rates of failure,BigData frameworks can guide jobs toward a successful completion as the data is replicated into multiple nodes/slaves. Use of Commodity hardware: Most of the BigData tools and frameworks need commodity hardware for its working which reduces the cost of the total infrastructure and very easy to add more clusters as data size increase.