hadoop mapreduce v1 and v2 configuration files

Hadoop ConfigurationThere are a handful of files for controlling the configuration of a Hadoop installation; the most important ones are listed in Table 9-1. This section covers MapReduce 1, which employs the jobtracker and tasktracker daemons. Running MapReduce 2 is substantially different, and is covered in “YARN Configuration” on page 318.

Table 9-1. Hadoop configuration files

Filename Format Description

hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop.

core-site.xml Hadoop configuration XML

Configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce.

hdfs-site.xml Hadoop configuration XML

Configuration settings for HDFS daemons: the namenode, the secondary namenode, and the datanodes.

mapred-site.xml Hadoop configuration XML

Configuration settings for MapReduce daemons: the jobtracker, and the tasktrackers.

masters Plain text A list of machines (one per line) that each run a secondary namenode.

Slaves Plain text A list of machines (one per line) that each run a datanode and a tasktracker.

hadoop-metrics.properties

Java Properties Properties for controlling how metrics are published in Hadoop (see “Metrics” on page 350).

log4j.properties Java Properties Properties for system logfiles, the namenode audit log, and the task log for the tasktracker child process (“Hadoop Logs” on page 173).

These files are all found in the conf directory of the Hadoop distribution. The configuration directory can be relocated to another part of the filesystem (outside the Hadoop

YARN ConfigurationYARN is the next-generation architecture for running MapReduce (and is described in “YARN (MapReduce 2)” on page 194). It has a different set of daemons and configuration options to classic MapReduce (also called MapReduce 1), and in this section we shall look at these differences and how to run MapReduce on YARN.

Under YARN you no longer run a jobtracker or tasktrackers. Instead, there is a single resource manager running on the same machine as the HDFS namenode (for small clusters) or on a dedicated machine, and node managers running on each worker node in the cluster.

The YARN start-all.sh script (in the bin directory) starts the YARN daemons in the cluster. This script will start a resource manager (on the machine the script is run on), and a node manager on each machine listed in the slaves file.

YARN also has a job history server daemon that provides users with details of past job runs, and a web app proxy server for providing a secure way for users to access the UI provided by YARN applications. In the case of MapReduce, the web UI served by the proxy provides information about the current job you are running, similar to the one described in “The MapReduce Web UI” on page 164. By default the web app proxy server runs in the same process as the resource manager, but it may be configured to run as a standalone daemon.

YARN has its own set of configuration files listed in Table 9-8, these are used in addition to those in Table 9-1.

Table 9-8. YARN configuration files

Filename Format Description

yarn-env.sh

Bash script Environment variables that are used in the scripts to run YARN.

yarn-site.xml

Hadoop configuration XML

Configuration settings for YARN daemons: the resource manager, the job history server, the webapp proxy server, and the node managers.

Important YARN Daemon PropertiesWhen running MapReduce on YARN the mapred-site.xml file is still used for general MapReduce properties, although the jobtracker and tasktracker-related properties are not used. None of the properties in Table 9-4 are applicable to YARN, except for mapred.child.java.opts (and the related properties mapreduce.map.java.opts and map reduce.reduce.java.opts which apply only to map or reduce tasks, respectively). The JVM options specified in this way are used to launch the YARN child process that runs map or reduce tasks.

The configuration files in Example 9-4 show some of the important configuration properties for running MapReduce on YARN.

Example 9-4. An example set of site configuration files for running MapReduce on YARN

<?xml version="1.0"?><configuration> <property> <name>mapred.child.java.opts</name> <value>-Xmx400m</value>  </property></configuration>

<?xml version="1.0"?><configuration> <property> <name>yarn.resourcemanager.address</name> <value>resourcemanager:8040</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/disk1/nm-local-dir,/disk2/nm-local-dir</value> <final>true</final> </property>

YARN Configuration

<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> </property></configuration>

The YARN resource manager address is controlled via yarn.resourceman ager.address, which takes the form of a host-port pair. In a client configuration this property is used to connect to the resource manager (using RPC), and in

addition the mapreduce.framework.name property must be set to yarn for the client to use YARN rather than the local job runner.

Although YARN does not honor mapred.local.dir, it has an equivalent property called yarn.nodemanager.local-dirs, which allows you to specify which local disks to store intermediate data on. It is specified by a comma-separated list of local directory paths, which are used in a round-robin fashion.

YARN doesn’t have tasktrackers to serve map outputs to reduce tasks, so for this function it relies on shuffle handlers, which are long-running auxiliary services running in node managers. Since YARN is a general-purpose service the shuffle handlers need to be explictly enabled in the yarn-site.xml by setting the yarn.nodemanager.aux-serv ices property to mapreduce.shuffle.

Table 9-9 summarizes the important configuration properties for YARN.

Table 9-9. Important YARN daemon properties

Property name Type Default value Description

yarn.resourceman ager.address

hostname and port 0.0.0.0:8040 The hostname and port that the resource manager’s RPC server runs on.

yarn.nodeman ager.local-dirs

comma-separated directory names

/tmp/nm-local-dir

A list of directories where node managers allow containers to store intermediate data. The data is cleared out when the application ends.

yarn.nodeman ager.aux-services

comma-separated service names

A list of auxiliary services run by the node manager. A service is implemented by the class defined by the propertyyarn.nodemanager.aux-serv

ices.service-name.class. Bydefault no auxiliary services are specified.

yarn.nodeman ager.resource.memory-mb

Int 8192 The amount of physical memory (in MB) which may be allocated to containers being run by the node manager.

Property name Type Default value Description

yarn.nodeman ager.vmem-pmem-ratio

Float 2.1 The ratio of virtual to physical memory for containers. Virtual memory usage may exceed the allocation by this amount.

hadoop mapreduce v1 and v2 configuration files

Documents