hadoop admin interview qns

Hadoop: How to interview for hadoop admin job? These are few problems whose solution a good hadoop admin should know.

List 3 hadoop fs shell commands to perform copy operation fs -copyToLocal fs -copyFromLocal fs -putHow to decommission nodes from HDFS cluster?- Remove list of nodes from slaves files and execute -refreshNodes.

How to add new nodes to the HDFS cluster ?- Add new node hostname to slaves file and start data node & task tracker on new node.

How to perform copy across multiple HDFS clusters.- Use distcp to copy files across multiple clusters.

How to verify if HDFS is corrupt?Execute Hadoop fsck to check for missing blocks.

What are the default configuration files that are used in HadoopAs of 0.20 release, Hadoop supported the following read-only default configurations- src/core/core-default.xml- src/hdfs/hdfs-default.xml- src/mapred/mapred-default.xml

How will you make changes to the default configuration filesHadoop does not recommends changing the default configuration files, instead it recommends making all site specific changes in the following files- conf/core-site.xml- conf/hdfs-site.xml- conf/mapred-site.xml

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:- core-default.xml : Read-only defaults for hadoop.- core-site.xml: Site-specific configuration for a given hadoop installation.

Hence if same configuration is defined in filecore-default.xmlandsrc/core/core-default.xmlthen the values in filecore-default.xml (same is true for other 2 file pairs)is used.

Consider case scenario where you have set propertymapred.output.compresstotrueto ensure that all output files are compressed forefficient space usage on the cluster.If a cluster user does not want to compress data for a specific job then what will you recommend him to do ?Ask him to create his own configuration file and specify configurationmapred.output.compresstofalseand load this file as a resource in his job.

What of the following is the only required variable that needs to be set in fileconf/hadoop-env.shfor hadoop to work- HADOOP_LOG_DIR- JAVA_HOME- HADOOP_CLASSPATHThe only required variable to set is JAVA_HOME that needs to point to directory

List all the daemons required to run the Hadoop cluster- NameNode-DataNode-JobTracker-TaskTracker

Whats the default port that jobtrackers listens to :50030

Whats the default port where the dfs namenode web ui will listen on :50070

Hadoop: How to test Hadoop cluster setup? You just finished Hadoop cluster setup, How will you verify hadoop cluster setup was successful ?Perform following steps to make sure that Hadoop cluster setup was successful.

1 . Copy/Put data on HDFS Use following commands to put some test data on HDFS cluster. > hadoop fs -put or> hadoop fs -copyToLocal Now perform hadoop fs -ls to verify copied files and make sure that copy/put operation was successful.To make sure that there was no error in data node connections tail on name node and data node logs for any errors. Possible connection errors or setup errors in permissions will be exposed in copy operation.

2. Run word count job. To make sure that Hadoop mapreduce is working properly run word count job.E.g :> hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount Check if word count job completes successfully and output dir is created. Also make sure that there are no error messages while word count job is completing.

3. Run Teragen job. You can run Teragen job to write huge data to cluster making sure all data nodes and task trackers are running correctly and all network connections.E.g :>hadoop jar $HADOOP_HOME/hadoop-examples.jar teragen 1000000000

4. Run TestDFSIO or DFSCIOTest Finally to get the throughput of the cluster you can run TestDFSIO or DFSCIOTestHadoop: How to decommission nodes? You want to take out some data nodes from your cluster, what is the graceful way to remove nodes without corrupting file system. On a large cluster removing one or two data-nodes will not lead to any data loss, because name-node will replicate their blocks as long as it will detect that the nodes are dead. With a large number of nodes getting removed or dying the probability of losing data is higher.

Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. This file should have been specified during namenode startup. It could be a zero length file. You must use the full hostname, ip or ip:port format in this file. Then the shell command

bin/hadoop dfsadmin -refreshNodes

should be called, which forces the name-node to re-read the exclude file and start the decommission process.

Decommission does not happen momentarily since it requires replication of potentially a large number of blocks and we do not want the cluster to be overwhelmed with just this one job. The decommission progress can be monitored on the name-node Web UI. Until all blocks are replicated the node will be in "Decommission In Progress" state. When decommission is done the state will change to "Decommissioned". The nodes can be removed whenever decommission is finished.

The decommission process can be terminated at any time by editing the configuration or the exclude files and repeating the -refreshNodes command. Hadoop: How to configure hadoop Name Node to store data on multiple volumes/disks? The name-node supports storing name node meta data on multiple directories, which in the case store the name space image and the edits log. The directories are specified via the dfs.name.dir configuration parameter in hdfs-site.xml . The name-node directories are used for the name space data replication so that the image and the log could be restored from the remaining volumes if one of them fails.

Example: Add this to hdfs-site.xml

dfs.name.dir/data/data01/hadoop/hdfs/name,/data/data02/hadoop/hdfs/nametrue

Hadoop: How to configure hadoop data nodes to store data on multiple volumes/disks? Data-nodes can store blocks in multiple directories typically allocated on different local disk drives. In order to setup multiple directories one needs to specify a comma separated list of pathnames as a value of the configuration parameter dfs.data.dir in hdfs-site.xml . Data-nodes will attempt to place equal amount of data in each of the directories.

Example : Add following to hdfs-site.xml

dfs.data.dir/data/data01/hadoop/hdfs/data,/data/data02/hadoop/hdfs/data,/data/data03/hadoop/hdfs/data,/data/data04/hadoop/hdfs/data,/data/data05/hadoop/hdfs/data,/data/data06/hadoop/hdfs/datatrue

Hadoop: How to write create/write-to hdfs files directly from map/reduce tasks? You can use ${mapred.output.dir} to get this done.

${mapred.output.dir} is the eventual output directory for the job (JobConf.setOutputPath / JobConf.getOutputPath).

${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000).

With speculative-execution on, one could face issues with 2 instances of the same TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't create/write-to files directly via reduce tasks.)

To get around this the framework helps the application-writer out by maintaining a special ${mapred.output.dir}/_${taskid} sub-dir for each task-attempt on hdfs where the output of the reduce task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.

The application-writer can take advantage of this by creating any side-files required in ${mapred.output.dir} during execution of his reduce-task, and the framework will move them out similarly - thus you don't have to pick unique paths per task-attempt.

Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by JobConf.setOutputPath. So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature.

The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to hdfs.

How to run multiple hadoop data nodes on one machine. Although Hadoop is designed and developed for distributed computing it can be run on a single node in pseudo distributed mode and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior, Name node interaction with data node and for other reasons.

If you want to feel Hadoop's distributed data node - name node working and you have only one machine then you can run multiple data nodes on single machine. You can see how Name node stores it's metadata , fsimage,edits , fstime and how data node stores data blocks on local file system.

Steps

To start multiple data nodes on a single node first download / build hadoop binary.1. Download hadoop binary or build hadoop binary from hadoop source.2. Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location)3. Add following script to the $HADOOP_HOME/bin directory and chmod it to 744.Format HDFS - bin/hadoop namenode -format (for Hadoop 0.20 and below), bin/hdfs namenode -format (for version > 0.21)4. Start HDFS bin/start-dfs.sh (This will start Namenode and 1 data node ) which can be viewed on http://localhost:500705. Start additional data nodes using bin/run-additionalDN.shrun-additionalDN.sh

#!/bin/sh# This is used for starting multiple datanodes on the same machine.# run it from hadoop-dir/ just like 'bin/hadoop'

#Usage: run-additionalDN.sh [start|stop] dnnumber#e.g. run-datanode.sh start 2

DN_DIR_PREFIX="/path/to/store/data_and_log_of_additionalDN/"

if [ -z $DN_DIR_PREFIX ]; thenecho $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn"exit 1fi

run_datanode () {DN=$2export HADOOP_LOG_DIR=$DN_DIR_PREFIX$DN/logsexport HADOOP_PID_DIR=$HADOOP_LOG_DIRDN_CONF_OPTS="\-Dhadoop.tmp.dir=$DN_DIR_PREFIX$DN\-Ddfs.datanode.address=0.0.0.0:5001$DN \-Ddfs.datanode.http.address=0.0.0.0:5008$DN \-Ddfs.datanode.ipc.address=0.0.0.0:5002$DN"bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS}

cmd=$1shift;

for i in $*dorun_datanode $cmd $idone

Use jps or Namenode Web UI to verify if additional data nodes are started. I started total 3 data nodes ( 2 additional data nodes) on my single node machine which are running on ports 50010,50011 and 50012

How to transfer data between different HDFS clusters. Problem : You have multiple Hadoop clusters running and you want to transfer several tera bytes of data from one cluster to another.

Solution : DistCp Distributed copy.

Its common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size ), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.

To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo

You can also specify multiple source directories on the command line:

bash$ hadoop distcp hdfs://nn1:8020/foo/a \hdfs://nn1:8020/foo/b \hdfs://nn2:8020/bar/foo

Or, equivalently, from a file using the -f option:bash$ hadoop distcp -f hdfs://nn1:8020/srclist \hdfs://nn2:8020/bar/foo

Where srclist containshdfs://nn1:8020/foo/ahdfs://nn1:8020/foo/b

How to build Hadoop with my custom patch? Problem : How do I build my own version of Hadoop with my custom patch.

Solution : Apply patch and build hadoop.

You will need : Hadoop Source code, Custom Patch, Java 6 , Apache Ant, Java 5 (for generating Documents), Apache Forrest (for generating documents).

Steps :

Checkout hadoop source code,

> svn co https://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z-rcR -m Hadoop-X.Y.Z-rcR.release.

Apply your patch for checking its functionality using following command

> patch -p0 -E < ~/Path/To/Patch.patch Ant test and compile source code with latest patch. > ant ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Path/to/forrest/apache-forrest-0.8 -Dfindbugs.home=/Path/to/findbugs/latest compile-core compile-core tar

How to build documents.

> ant -Dforrest.home=$FORREST_HOME -Djava5.home=$JAVA5 docs

JPS - Java Process Status tool You are running your java program and wondering what all process are running in JVM. Ever wondered how to see java process ?Use JPS for viewing Java Virtual Machine StatusThe jps tool lists the instrumented HotSpot Java Virtual Machines (JVMs) on the target system. The tool is limited to reporting information on JVMs for which it has the access permissions.If jps is run without specifying a hostid, it will look for instrumented JVMs on the local host. If started with a hostid, it will look for JVMs on the indicated host, using the specified protocol and port. A jstatd process is assumed to be running on the target host.The jps command will report the local VM identifier, or lvmid, for each instrumented JVM found on the target system. The lvmid is typically, but not necessarily, the operating system's process identifier for the JVM process. With no options, jps will list each Java application'slvmid followed by the short form of the application's class name or jar file name. The short form of the class name or JAR file name omits the class's package information or the JAR files path information.The jps command uses the java launcher to find the class name and arguments passed to themain method. If the target JVM is started with a custom launcher, the class name (or JAR file name) and the arguments to the main method will not be available. In this case, the jpscommand will output the string Unknown for the class name or JAR file name and for the arguments to the main method.The list of JVMs produced by the jps command may be limited by the permissions granted to the principal running the command. The command will only list the JVMs for which the principle has access rights as determined by operating system specific access control mechanisms.Options-The jps command supports a number of options that modify the output of the command. These options are subject to change or removal in the future.-qSuppress the output of the class name, JAR file name, and arguments passed to the mainmethod, producing only a list of local VM identifiers.-mOutput the arguments passed to the main method. The output may be null for embedded JVMs.-lOutput the full package name for the application's main class or the full path name to the application's JAR file.-vOutput the arguments passed to the JVM.-VOutput the arguments passed to the JVM through the flags file (the .hotspotrc file or the file specified by the -XX:Flags= argument).-JoptionPass option to the java launcher called by javac. For example, -J-Xms48m sets the startup memory to 48 megabytes. It is a common convention for -J to pass options to the underlying VM executing applications written in Java.

Java Clone method To clone something is to make a duplicate of it.The clone() method in Java makes an exact duplicate of an object .

Why would someone need cloneing ?Java's method calling semantics are call-by-refrence , which allows the called method to modify the state of an object that is passed into it . Cloning the input object before calling the method would pass a copy of the object keeping orignal safe.

Cloneing is not enabled by default in classes that you write.Clone method is a protected method , which means that your code cannot simply call it .Only the class defining can clone it's objects.

Foo f = new Foo();Foo f2 = new f.clone();

If you try clone() without any special prepration as in code written above you will encouter errors.

How to clone?You must do two things to make your class cloneable: Override Object's Clone method Implement the empty Cloneable interfaceExample :public class FooClone impements Cloneable{public FooObject clone() throws CloneNotSupportedException( return super.clone();)// more code

Admin command saveNamespace.It would be useful to have an admin command that saves current namespace.This command can be used before regular (planned) cluster shutdown.The command will save the namespace into storage directory(s) and reset the name-node journal (edits file).It will also reduce name-node startup time, because edits do not need to be digest.saveNamespace will save the namespace image directly to disk(s), it does not need to replay the journal. Since saving the image is much faster than digesting the edits the command can substantially reduce the overall cluster restart time.Recommended procedure for restarting the cluster:1. enter safe mode2. save namespace3. shutdown the cluster4. start the clusterThe patch introduces a new DFSAdmin command which is called usinghadoop dfsadmin -saveNamespaceAs all other DFSAdmin commands it requires superuser permissions.In addition, the name-node must be in safe mode, because we don't want to allow changing namespace during the save.In order to enter safe mode callhadoop dfsadmin -safemode enterThe patch also corrects 2 warnings in TestCheckpoint, and 2 Javadoc warnings in FSNamesystem.