revolution r enterprise™ 7 hadoop configuration guide

Revolution R Enterprise™ 7

Hadoop Configuration Guide

We want our documentation to be useful, and we want it to address your needs. If you have comments on this or any

Revolution document, write to [email protected].

The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2015.

Revolution R Enterprise 7 Hadoop Configuration Guide. Revolution Analytics, Inc., Mountain View, CA.

Revolution R Enterprise 7 Hadoop Configuration Guide

Copyright © 2015 Revolution Analytics, Inc. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or

by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written

permission of Revolution Analytics.

U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related

documentation by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The

Rights in Technical Data and Computer Software clause at 52.227-7013.

Revolution R, Revolution R Enterprise, RPE, RevoScaleR, DeployR, RevoTreeView, and Revolution

Analytics are trademarks of Revolution Analytics.

Revolution R includes the Intel® Math Kernel Library (https://software.intel.com/en-us/intel-mkl).

RevoScaleR includes Stat/Transfer software under license from Circle Systems, Inc. Stat/Transfer is a

trademark of Circle Systems, Inc.

Other product names mentioned herein are used for identification purposes only and may be trademarks

of their respective owners.

Revolution Analytics

2570 West El Camino Real

Suite 222

Mountain View, CA 94040

U.S.A.

Revised on August 20, 2015

https://software.intel.com/en-us/intel-mkl

Table of Contents

1 Introduction ................................................................................................................... 1

1.1 System Requirements ...................................................................................................... 1

1.2 Basic Hadoop Terminology ............................................................................................... 3

1.3 Verifying the Hadoop Installation .................................................................................... 4

1.4 Adjusting Hadoop Memory Limits (Hadoop 2.x Systems Only) ....................................... 4

2 Hadoop Security with Kerberos Authentication .............................................................. 5

3 Installing Revolution R Enterprise on a Cluster................................................................ 6

3.1 Standard Command Line Install ....................................................................................... 6

3.2 Distributed Installation with RevoMPM........................................................................... 7

3.3 Installing the Revolution R Enterprise JAR File ................................................................ 8

3.4 Environment Variables for Hadoop .................................................................................. 8

3.5 Creating Directories for Revolution R Enterprise ............................................................. 9

3.6 Installing on a Cloudera Manager System Using a Cloudera Manager Parcel ................. 9

4 Verifying Installation .................................................................................................... 11

5 Troubleshooting the Installation .................................................................................. 13

5.1 No Valid Credentials ....................................................................................................... 13

5.2 Unable to Load Class RevoScaleR ................................................................................... 13

5.3 Classpath Errors .............................................................................................................. 13

5.4 Unable to Load Shared Library ....................................................................................... 13

6 Getting Started with Hadoop ........................................................................................ 14

7 Using HDFS Caching ...................................................................................................... 14

8 Creating an R Package Parcel for Cloudera Manager ..................................................... 15

1 Introduction

Revolution R Enterprise is the scalable data analytics solution, and it is designed to work seamlessly whether your computing environment is a single-user workstation, a local network of connected servers, or a cluster in the cloud. This manual is intended for those who need to configure a Hadoop cluster for use with Revolution R Enterprise.

This manual assumes that you have download instructions for the Revolution R Enterprise and related files; if you do not have those instructions, please contact Revolution Analytics Technical Support for assistance.

1.1 System Requirements

Revolution R Enterprise works with the following Hadoop distributions:

Cloudera CDH 5.0, 5.1, 5.2, 5.3

HortonWorks HDP 1.3.0, HDP 2.0.0, HDP 2.1.0, HDP 2.2.0

MapR 3.0.2, MapR 3.0.3, MapR 3.1.0, MapR 3.1.1, MapR 4.0.1, MapR 4.0.2 (provided this version of MapR has been updated to mapr-patch-4.0.2.29870.GA-30600; contact MapR to obtain the patch)

Your cluster installation must include the C APIs contained in the libhdfs package; these are required for Revolution R Enterprise. See your Hadoop documentation for information on installing this package. The Hadoop distribution must be installed on Red Hat Enterprise Linux 5 or 6, or fully compatible operating system. Revolution R Enterprise should be installed on all nodes of the cluster.

Revolution R Enterprise requires Hadoop MapReduce and the Hadoop Distributed File System (HDFS) (for HDP 1.3.0 and MapR 3.x installations), or HDFS, Hadoop YARN, and Hadoop MapReduce2 for CDH5, HDP 2.x, and MapR 4.0.x installations. The HDFS, YARN, and MapReduce clients must be installed on all nodes on which you plan to run Revolution R Enterprise, as must Revolution R Enterprise itself.

Minimum system configuration requirements for Revolution R Enterprise are as follows:

Processor: 64-bit CPU with x86-compatible architecture (variously known as AMD64, Intel64, x86-64, IA-32e, EM64T, or x64 CPUs). Itanium-architecture CPUs (also known as IA-64) are not supported. Multiple-core CPUs are recommended.

Operating System: Red Hat Enterprise Linux 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, or 6.6. Only 64-bit operating systems are supported. (For HDP 1.3.0 systems only, RHEL 5.x operating systems are also supported.)

Memory: A minimum of 4GB of RAM is required for Revolution R Enterprise; 8GB or more are recommended. Hadoop itself has substantial memory requirements; see your Hadoop distribution’s documentation for specific recommendations

http://support.revolutionanalytics.com/

2 Introduction

Disk Space: A minimum of 500MB of disk space is required on each node for RRE installation. Hadoop itself has substantial disk space requirements; see your Hadoop distribution’s documentation for specific recommendations.

Package Dependencies: Revolution R Enterprise, like most Linux applications, depends upon a number of Linux packages. A few of these, listed in Table 1, are explicitly required by Revolution R Enterprise. The remainder are in turn required by these dependencies. These are automatically installed while the automated script is running. These are listed in Table 2.

Table 1. Packages Explicitly Required by Revolution R Enterprise

ed cairo-devel

tk-devel make

gcc-objc gcc-c++

readline-devel libtiff-devel

ncurses-devel pango-devel

perl texinfo

libgfortran pango

libicu libjpeg*-devel

ghostscript-fonts gcc-gfortran

libSM-devel libicu-devel

libXmu-devel bzip2-devel

Table 2. Secondary Dependencies Installed for Revolution R Enterprise

cloog-ppl cpp

font-config-devel freetype

freetype-devel gcc

glib2-devel libICE-devel

libobjc libpng-devel

libstdc++-devel libX11-devel

libXau-devel libxcb-devel

libXext-devel libXft-devel

libXmu libXrender-devel

libXt-devel mpfr

pixman-devel ppl glibc-headers

tcl gmp tcl-devel kernel-headers

tk xorg-x11-proto-devel

zlib-devel

Introduction 3

1.2 Basic Hadoop Terminology

The following terms apply to computers and services within the Hadoop cluster, and define the roles of hosts within the cluster:

Hadoop 1.x Installations (HDP 1.3.0, MapR 3.x)

JobTracker: The Hadoop service that distributes MapReduce tasks to specific nodes in the cluster. The JobTracker queries the NameNode to find the location of the data needed for the tasks, then distributes the tasks to TaskTracker nodes near (or co-extensive with) the data. For small clusters, the JobTracker may be running on the NameNode, but this is not recommended for production use.

NameNode: A host in the cluster that is the master node of the HDFS file system, managing the directory tree of all files in the file system. In small clusters, the NameNode may host the JobTracker, but this is not recommended for production use.

TaskTracker: Any host that can accept tasks (Map, Reduce, and Shuffle operations) from a JobTracker. TaskTrackers are usually, but not always, also DataNodes, so that tasks assigned to the TaskTracker can work on data on the same node.

DataNode: A host that stores data in the Hadoop Distributed File System. DataNodes connect to the NameNode, and responds to requests from the NameNode for file system operations.

Hadoop 2.x Installations (CDH5, HDP 2.x, MapR 4.0.x)

Resource Manager: The Hadoop service that distributes MapReduce and other Hadoop tasks to specific nodes in the cluster. The Resource Manager takes over the scheduling functions of the old JobTracker, determining which nodes are appropriate for the current job.

NameNode: A host in the cluster that is the master node of the HDFS file system, managing the directory tree of all files in the file system.

Application Master: New in MapReduce2/YARN, the application master takes over the task progress coordination from the old JobTracker, working with node managers on the individual task nodes. The application master negotiates with the Resource Manager for cluster resources, which are allocated as a set of containers, with each container running an application-specific task on a particular node.

NodeManager: Node managers manage the containers allocated for a given task on a given node, coordinating with the Resource Manager and the Application Masters. NodeManagers are usually, but not always, also DataNodes, and most frequently the containers on a given node are working with data on the same node.

DataNode: A host that stores data in the Hadoop Distributed File System. DataNodes connect to the NameNode, and responds to requests from the NameNode for file system operations.

4 Introduction

1.3 Verifying the Hadoop Installation

We assume you have already installed Hadoop on your cluster. If not, use the documentation provided with your Hadoop distribution to help you perform the installation; Hadoop installation is complicated and involves many steps--following the documentation carefully does not guarantee success, but it does make troubleshooting easier. In our testing, we have found the following documents helpful:

Cloudera CDH5, package install

Cloudera CDH5, Cloudera Manager parcel install

Hortonworks HDP 1.3

Hortonworks HDP 2.1

Hortonworks HDP 1.x or 2.x, Ambari install

MapR 3.x install

MapR 4.0.2 (M5 Edition)

If you are using Cloudera Manager, it is important to know if your installation was via packages or parcels; the Revolution R Enterprise Cloudera Manager parcel can be used only with parcel installs. If you have installed Cloudera Manager via packages, do not attempt to use the RRE Cloudera Manager parcel; use the standard Revolution R Enterprise for Linux installer instead.

It is useful to confirm that Hadoop itself is running correctly before attempting to install Revolution R Enterprise on the cluster. Hadoop comes with example programs that you can run to verify that your Hadoop installation is running properly, in the jar file hadoop-mapreduce-examples.jar. The following command should display a list of the available examples:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

(On MapR, the quick installation installs the Hadoop files to /opt/mapr by default; the

path to the examples jar file is /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-

dev-examples.jar. Similarly, on Cloudera Manager parcel installs, the default path to

the examples is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce-

examples.jar.)

The following runs the pi example, which uses Monte Carlo sampling to estimate pi; the 5 tells Hadoop to use 5 mappers, the 300 says to use 300 samples per map:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 5 300

If you can successfully run one or more of the Hadoop examples, your Hadoop installation was successful and you are ready to install Revolution R Enterprise.

1.4 Adjusting Hadoop Memory Limits (Hadoop 2.x Systems Only)

On YARN-based Hadoop systems (CDH5, HDP 2.x, MapR 4.0.x), we have found that the default settings for Map and Reduce memory limits are inadequate for large RevoScaleR jobs. The memory available for R is the difference between the container’s

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/CDH5-Installation-Guide.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/Cloudera-Manager-Installation-Guide.html

http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_installing_manually_book/content/rpm-chap1.html

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1-latest/bk_installing_manually_book/content/rpm-chap1.html

http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.0.0/bk_using_Ambari_book/content/index.html

http://doc.mapr.com/display/MapR3/Quick+Installation+Guide

http://doc.mapr.com/display/MapR/Quick+Installation+Guide

Hadoop Security with Kerberos Authentication 5

memory limit and the memory given to the Java Virtual Machine. To allow large RevoScaleR jobs to run, we need to modify four properties in mapred-site.xml and one in yarn-site.xml, as follows (these files are typically found in /etc/hadoop/conf):

(in mapred-site.xml)

<name>mapreduce.map.memory.mb</name>

<value>2048</value>

<name>mapreduce.reduce.memory.mb</name>

<value>2048</value>

<name>mapreduce.map.java.opts</name>

<value>-Xmx1229m</value>

<name>mapreduce.reduce.java.opts</name>

<value>-Xmx1229m</value>

(in yarn-site.xml)

<name>yarn.nodemanager.resource.memory-mb</name>

<value>3198</value>

If you are using a cluster manager such as Cloudera Manager or Ambari, these settings must usually be modified using the Web interface.

2 Hadoop Security with Kerberos Authentication

By default, most Hadoop configurations are relatively insecure. Security features such as SELinux and IPtables firewalls are often turned off to help get the Hadoop cluster up and running quickly. However, Cloudera and Hortonworks distributions of Hadoop support Kerberos authentication, which allows Hadoop to operate in a much more secure manner. To use Kerberos authentication with your particular version of Hadoop, see one of the following documents:

o Cloudera CDH5 o Cloudera CDH5 with Cloudera Manager 5 o Hortonworks HDP 1.3 o Hortonworks HDP 2.x o Hortonworks HDP (1.3 or 2.x) with Ambari

If you have trouble restarting your Hadoop cluster after enabling Kerberos authentication, the problem is most likely with your keytab files. Be sure you have created all the required Kerberos principals and generated appropriate keytab entries for all of your nodes, and that the keytab files have been located correctly with the appropriate permissions. (We have found that in Hortonworks clusters managed with Ambari, it is important that the spnego.service.keytab file be present on all the nodes of the cluster, not just the name node and secondary namenode.)

The MapR distribution also supports Kerberos authentication, but most MapR installations use that distribution’s wire-level security feature. See the MapR Security Guide for details.

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/CDH5-Security-Guide.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Configuring-Hadoop-Security-with-Cloudera-Manager/cm5chs_using_cm_sec_config.html



http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.0.0/bk_ambari_security/content/kerb-prepare-hdp.html

http://doc.mapr.com/display/MapR/Security+Guide

http://doc.mapr.com/display/MapR/Security+Guide

6 Installing Revolution R Enterprise on a Cluster


It is highly recommended that you install Revolution R Enterprise as root on each node of your Hadoop cluster. This ensures that all users will have access to it by default. Non-root installs are supported, but require that the path to the R executable files be added to each user’s path.

If you are installing on a Cloudera Manager system using a parcel install, skip to Section 3.6, Installing on a Cloudera Manager System Using a Cloudera Manager Parcel.

3.1 Standard Command Line Install

For most users, installing on the cluster means simply running the standard Revolution R Enterprise installers on each node of the cluster:

1. Log in as root or a user with sudo privileges. If the latter, precede commands requiring root privileges with sudo. (If you do not have root access or sudo privileges, you can install as a non-root user. See Section 3.2 for details.)

2. Make sure the system repositories are up to date prior to installing Revolution R Open:

sudo yum clean all

3. Download the Revolution R Open tarball. 4. Change to the directory to which you downloaded the tarball (for example, /tmp):

cd /tmp

5. Unzip the contents of the RRO installer tarball (asterisk will be a 5 or 6 depending on your RHEL operating system):

tar xvzf RRO-8.0.3-el*.x86_64.tar.gz

6. Change to the RRO-8.0.3 directory:

cd RRO-8.0.3

7. Run the RRO install script:

./install.sh

8. Download and unpack the Revolution R Connector tarball, then run the installer script, as follows (the tarball name may include an operating system ID denoted below by <OS>; the complete name of the tarball will be in your download letter):

tar xvzf Revolution-R-Connector-7.4.0-<OS>.tar.gz

pushd rrconn

./install_rr_conn.sh –y -p /usr/lib64/RRO-8.0.3/R-3.1.3

popd

Installing Revolution R Enterprise on a Cluster 7

9. Download and unpack the Revolution R Enterprise tarball, then run the installer script, as follows (the tarball name may include an operating system ID denoted below by <OS>; the complete name of the tarball will be in your download letter):

tar xvzf Revolution-R-Enterprise-7.4.0-<OS>.tar.gz

pushd rrent

./install_rr_ent.sh –a –y -p /usr/lib64/RRO-8.0.3/R-3.1.3

popd

This installs Revolution R Enterprise with the standard options (including loading the rpart and lattice packages by default when RevoScaleR is loaded).

3.2 Distributed Installation with RevoMPM

If your Hadoop cluster is configured to allow passwordless-ssh access among the various nodes, you can use the Revolution Multi-Node Package Manager (RevoMPM) to deploy Revolution R Enterprise across your cluster. RevoMPM is a lightweight wrapper to the python fabric package.

On any one node of the cluster, create a directory for the installer, such as /var/tmp/revo-install, and download the following files to that directory (you can find the links in your welcome e-mail):

RevoMPM-0.3-5-*.x86_64.rpm

RRO-8.0.3-*.x86_64.tar.gz

Revolution-R-Connector-7.4.1-*.tar.gz

Revolution-R-Enterprise-7.4.1-*.tar.gz

(The “*” in the above file names represents an operating system indicator, such as RHEL5 or RHEL6.)

For best results, install RevoMPM as root. You install RevoMPM directly from the rpm as follows:

rpm -i RevoMPM-*.x86_64.rpm

When run as root, this installs RevoMPM to the /opt/RevoMPM directory. Add that directory’s bin subdirectory (/opt/RevoMPM/bin) to your system PATH variable. To ensure ready access to your nodes via RevoMPM, edit the file /opt/RevoMPM/hosts.cfg to list the nodes in your cluster. The host configuration file follows the standard Python config file format. Only one section, groups, is required in this config file, e.g.

[groups] nodes = ip-10-0-0-132 ip-10-0-0-133 ip-10-0-0-134 ip-10-0-0-135 ip-10-0-0-136

https://docs.python.org/2/library/configparser.html

https://docs.python.org/2/library/configparser.html


Note the four spaces of indentation for continuation lines; if this is missing, the underlying Python interpreter will report a parsing error. (Any consistent number of spaces or tabs can be used; four spaces is the Python standard.)

Issue the following commands to distribute and install Revolution R Enterprise (ensure that each revompm command is run on a single logical line, even if it spans two lines below due to space constraints):

revompm cmd:"mkdir -p /var/tmp/revo-install"

revompm dcp:/var/tmp/revo-install/RRO-8.0.3-*.x86_64.tar.gz

revompm cmd:"cd /var/tmp/revo-install;tar -xzf RRO-8.0.3-*.x86_64.tar.gz"

revompm cmd:"yum clean all"

revompm cmd:"cd /var/tmp/revo-install/RRO-8.0.3;./install.sh"

revompm dcp:/var/tmp/revo-install/Revolution-R-Connector-7.4.1-<OS>.tar.gz

revompm cmd:"cd /var/tmp/revo-install;tar zxf Revolution-R-Connector-*"

revompm cmd:"cd /var/tmp/revo-install/rrconn;./install_rr_conn.sh -y -p /usr/lib64/RRO-

8.0.3/R-3.1.3"

revompm dcp:/var/tmp/revo-install/Revolution-R-Enterprise-7.4.1-<OS>.tar.gz

revompm cmd:"cd /var/tmp/revo-install;tar zxf Revolution-R-Enterprise-*"

revompm cmd:"cd /var/tmp/revo-install/rrent;./install_rr_ent.sh -y -a -p /usr/lib64/RRO-

8.0.3/R-3.1.3"

For complete instructions on installing and running RevoMPM (including instructions for installing as a non-root user), see the RevoMPM User’s Guide.

3.3 Installing the Revolution R Enterprise JAR File

Using Revolution R Enterprise in Hadoop requires the presence of the Revolution R Enterprise Java Archive (JAR) file scaleR-hadoop-0.1-SNAPSHOT.jar. This file is installed in the scripts directory of your Revolution R Enterprise installation (typically at /usr/lib64/Revo-7.4/scripts), and is typically linked to the standard Hadoop jar file location (typically $HADOOP_HOME/lib or $HADOOP_PREFIX/lib).

If you are installing RRE as a non-root user, you may need to obtain root access to link this file appropriately.

3.4 Environment Variables for Hadoop

The file RevoHadoopEnvVars.site in the scripts directory of your Revolution R Enterprise installation (typically at /usr/lib64/Revo-7.4/scripts) should be sourced by all users, by adding the following line to the .bash_profile file:

. /usr/lib64/Revo-7.4/scripts/RevoHadoopEnvVars.site

(The period (“.”) at the beginning is part of the command, and must be included.)

This file sets the following environment variables for use by Revolution R Enterprise:

HADOOP_HOME This should be set to the directory containing the Hadoop files.

HADOOP_CMD: This should be set to the command used to invoke Hadoop

http://packages.revolutionanalytics.com/doc/7.4.0/linux/RevoMPM_Users_Guide.pdf

Installing Revolution R Enterprise on a Cluster 9

HADOOP_CLASSPATH: This should be set to include the full path to the RRE jar files (typically /usr/lib64/Revo-7.4/scripts).

CLASSPATH: This should be a fully expanded CLASSPATH with access to all required Hadoop JAR files.

JAVA_LIBRARY_PATH: If necessary, this should be set to include paths to the directories containing Hadoop jar files.

HADOOP_STREAMING: This should be set to the path of the Hadoop streaming jar file.

These environment variables are written to the file automatically on installation, but can be edited by hand if necessary.

3.5 Creating Directories for Revolution R Enterprise

Each user should ensure that the appropriate user directories exist, and if necessary, create them with the following commands:

hadoop fs -mkdir /user/RevoShare/$USER

hadoop fs -chmod uog+rwx /user/RevoShare/$USER

mkdir -p /var/RevoShare/$USER

chmod uog+rwx /var/RevoShare/$USER

The HDFS directory can also be created in a user’s R session (provided the top-level /user/RevoShare has the appropriate permissions) using the following RevoScaleR commands (substitute your actual user name for “username”):

rxHadoopMakeDir("/user/RevoShare/username")

rxHadoopCommand("fs -chmod uog+rwx /user/RevoShare/username")

3.6 Installing on a Cloudera Manager System Using a Cloudera Manager Parcel

If you are running a Cloudera Hadoop cluster managed by Cloudera Manager, and if Cloudera itself was installed via a Cloudera Manager parcel , you can use the Revolution R Enterprise Cloudera Manager parcels to install Revolution R Enterprise on all the nodes of your cluster. Three parcels are required:

Revolution R Open parcel—installs open-source R on the nodes of your Cloudera cluster

Revolution R Connector parcel—installs open-source Revolution components on the nodes of your Cloudera cluster

Revolution R Enterprise parcel—installs proprietary Revolution components on the nodes of your Cloudera cluster

Revolution R Enterprise requires several packages that may not be in a default Red Hat Enterprise Linux installation, run the following yum command as root to install them:

yum install gcc-gfortran cairo-devel python-devel \

tk-devel libicu-devel


Run this command on all the nodes of your cluster that will be running Revolution R Enterprise. If you have installed RevoMPM, you can distribute the command using RevoMPM’s cmd command:

revompm cmd:"yum install gcc-gfortran cairo-devel python-devel tk-devel libicu-devel"

Once you have installed the Revolution R Enterprise prerequisites, install the Cloudera Manager parcels as follows:

1. Download the Revolution R Enterprise Cloudera Manager Parcels using the links provided in your welcome e-mail. (Note that each parcel consists of two files, the parcel itself and its associated .sha file. They may be packaged as a single .tar.gz file for convenience in downloading, but that must be unpacked and the two files copied to the parcel-repo for Cloudera Manager to recognize them as a parcel.)

2. Copy the parcel files to your local parcel-repo, typically /opt/cloudera/parcel-repo. You should have the following files in your parcel repo: RRO-8.0.3-1-el6.parcel RRO-8.0.3-1-el6.parcel.sha RevolutionR-7.4.1-1-el6.parcel RevolutionR-7.4.1-1-el6.parcel.sha RRE-7.4.1-1-el6.parcel RRE-7.4.1-1-el6.parcel.sha Be sure all the files are owned by root and have 755 permissions (that is, read, write, execute permission for root, and read and execute permissions for group and others).

3. In your browser, open Cloudera Manager. 4. Click Hosts in the upper navigation bar to bring up the All Hosts page. 5. Click Parcels to bring up the Parcels page. 6. Click Check for New Parcels. RRO 8.0.3, RevolutionR 7.4.1-1, and RRE 7.4.1-1

should each appear with a Distribute button. After clicking Check for New Parcels you may need to click on “All Clusters” under the “Location” section on the left to see the new parcels.

7. Click the RRO 8.0.3 Distribute button. Revolution R Open will be distributed to all the nodes of your cluster. When the distribution is complete, the Distribute button is replaced with an Activate button.

8. Click Activate. Activation prepares Revolution R Open to be used by the cluster. 9. Click the Revolution R 7.4.1 Distribute button. The Revolution R Connector will

be distributed to all the nodes of your cluster. When the distribution is complete, the Distribute button is replaced with an Activate button.

10. Click Activate. Activation prepares the Revolution R Connector to be used by the cluster.

11. Click the RRE 7.4.1-1 Distribute button. Revolution R Enterprise will be distributed to all the nodes of your cluster. When the distribution is complete, the Distribute button is replaced with an Activate button.

12. Click Activate. Activation prepares Revolution R Enterprise to be used by the cluster.

Verifying Installation 11

When you have installed the three parcels, download, install, and run the Revolution Custom Service Descriptor as follows:

1. Download the Custom Service Descriptor from the links in your welcome e-mail to the Cloudera CSD directory, typically /opt/cloudera/csd.

2. Stop and restart the cloudera-scm-server service using the following shell commands:

service cloudera-scm-server stop

service cloudera-scm-server start

3. Confirm the CSD is installed by checking the Custom Service Descriptor list in Cloudera Manager at <hostname>/cmf/csd/list, where <hostname> is the host name of your Cloudera Manager server.

4. On the Cloudera Manager home page, click the dropdown beside the cluster name and click Add a Service.

5. From the Add Service Wizard, select Revolution R and click Continue. 6. Select all hosts, and click Continue. 7. Accept defaults through the remainder of the wizard.

Each user should ensure that the appropriate user directories exist, and if necessary, create them with the following shell commands:

hadoop fs -mkdir /user/RevoShare/$USER

hadoop fs -chmod uog+rwx /user/RevoShare/$USER

mkdir -p /var/RevoShare/$USER

chmod uog+rwx /var/RevoShare/$USER

The HDFS directory can also be created in a user’s R session (provided the top-level /user/RevoShare has the appropriate permissions) using the following RevoScaleR commands (substitute your actual user name for “username”):

rxHadoopMakeDir("/user/RevoShare/username")

rxHadoopCommand("fs -chmod uog+rwx /user/RevoShare/username")

As part of this process make sure to check that the base directories /user and /user/RevoShare have uog+rwx permissions as well.

4 Verifying Installation

After completing installation, do the following to verify that Revolution R Enterprise will actually run commands in Hadoop:

1. If the cluster is security-enabled, obtain a ticket using kinit (for Kerberos authentication) or mapr password (for MapR wire security).

2. Start Revolution R Enterprise on a cluster node by typing Revo64 at a shell prompt.

12 Verifying Installation

3. At the R prompt “> “, enter the following commands (these commands are drawn from the RevoScaleR Hadoop Getting Started Guide, which explains what all of them are doing. For now, we are just trying to see if everything works):

bigDataDirRoot <- "/share"

myHadoopCluster <- RxHadoopMR(consoleOutput=TRUE)

rxSetComputeContext(myHadoopCluster)

source <- system.file("SampleData/AirlineDemoSmall.csv",

package="RevoScaleR")

inputDir <- file.path(bigDataDirRoot,"AirlineDemoSmall")

rxHadoopMakeDir(inputDir)

rxHadoopCopyFromLocal(source, inputDir)

hdfsFS <- RxHdfsFileSystem()

colInfo <- list(DayOfWeek = list(type = "factor",

levels = c("Monday", "Tuesday", "Wednesday", "Thursday",

"Friday", "Saturday", "Sunday")))

airDS <- RxTextData(file = inputDir, missingValueString = "M",

colInfo = colInfo, fileSystem = hdfsFS)

adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek,

data = airDS)

adsSummary

If you installed Revolution R Enterprise in a non-default location, you must specify the location using both the hadoopRPath and revoPath arguments to RxHadoopMR:

myHadoopCluster <- RxHadoopMR(hadoopRPath=/path/to/Revo64,

revoPath=/path/to/Revo64)

If you see the following, congratulations:

Call: rxSummary(formula = ~ArrDelay + CRSDepTime + DayOfWeek, data = airDS) Summary Statistics Results for: ~ArrDelay + CRSDepTime + DayOfWeek Data: airDS ( RxTextData Data Source) File name: /share/AirlineDemoSmall Number of valid observations: 6e+05 Name Mean StdDev Min Max ValidObs MissingObs ArrDelay 11.31794 40.688536 - 86.000000 1490.00000 582628 17372 CRSDepTime 13.48227 4.697566 0.016667 23.98333 600000 0 Category Counts for DayOfWeek Number of categories: 7 Number of valid observations: 6e+05 Number of missing observations: 0 DayOfWeek Counts Monday 97975 Tuesday 77725 Wednesday 78875 Thursday 81304 Friday 82987 Saturday 86159

http://packages.revolutionanalytics.com/doc/7.4.1/linux/RevoScaleR_Hadoop_Getting_Started.pdf

Troubleshooting the Installation 13

Sunday 94975

Next try to run a simple rxExec job:

rxExec(list.files)

That should return a list of files in the native file system. If either the call to rxSummary or the call to rxExec results in an error, see section 5, Troubleshooting the Installation, for a few of the more common errors and how to fix them.

5 Troubleshooting the Installation

No two Hadoop installations are exactly alike, but most are quite similar. This section brings together a number of common errors seen in attempting to run Revolution R Enterprise commands on Hadoop clusters, and the most likely causes of such errors from our experience.

5.1 No Valid Credentials

If you see a message such as “No valid credentials provided”, this means you do not have a valid Kerberos ticket. Quit Revolution R Enterprise, obtain a Kerberos ticket using kinit, and then restart Revolution R Enterprise.

5.2 Unable to Load Class RevoScaleR

If you see a message about being unable to find or load main class RevoScaleR, this means that the jar file scaleR-hadoop-0.1-SNAPSHOT.jar could not be found. This jar file must be in a location where it can be found by the getHadoopEnvVars.py script, or its location must be explicitly added to the CLASSPATH.

5.3 Classpath Errors

If you see other errors related to Java classes, these are likely related to the settings of the following environment variables:

PATH

CLASSPATH

JAVA_LIBRARY_PATH

Of these, the most commonly misconfigured is the CLASSPATH.

5.4 Unable to Load Shared Library

If you see a message about being unable to load libhdfs.so, you may need to create a symbolic link from your installed version of libhdfs.so to the system library, such as the following:

14 Getting Started with Hadoop

ln -s /path/to/libhdfs.so /usr/lib64/libhdfs.so

Or, update your LD_LIBRARY_PATH environment variable to include the path to the libjvm shared object:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/libhdfs.so

(This step is normally performed automatically during the RRE install. If you continue to see errors about libhdfs.so, you may need to both create the symbolic link as above and set LD_LIBRARY_PATH.)

Similarly, if you see a message about being unable to load libjvm.so, you may need to create a symbolic link from your installed version of libjvm.so to the system library, such as the following:

ln -s /path/to/libjvm.so /usr/lib64/libjvm.so

Or, update your LD_LIBRARY_PATH environment variable to include the path to the libjvm shared object:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/libjvm.so

6 Getting Started with Hadoop

To get started with Revolution R Enterprise on Hadoop, we recommend the RevoScaleR 7 Hadoop Getting Started Guide PDF. This provides a tutorial introduction to using RevoScaleR with Hadoop.

7 Using HDFS Caching

HDFS caching, more formally centralized cache management in HDFS, can greatly improve the performance of your Hadoop jobs by keeping frequently used data in memory. You enable HDFS caching on a path by path basis, first by creating a pool of cached paths, and then adding paths to the pool.

The HDFS command cacheadmin is used to perform these tasks. This command should be run by the hdfs user (the mapr user on MapR installations). The cacheadmin command has many subcommands; the Apache Software Foundation has complete documentation. To get started, the addPool and addDirective commands will suffice.

For example, to specify HDFS caching for our /share/AirlineDemoSmall directory, we can first create a pool as follows:

hdfs cacheadmin –addPool rrePool



http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

Creating an R Package Parcel for Cloudera Manager 15

You can then add the path to /share/AirlineDemoSmall to the pool with an addDirective command as follows:

hdfs cacheadmin –addDirective –path /share/AirlineDemoSmall –pool rrePool

8 Creating an R Package Parcel for Cloudera Manager

If you are using Cloudera Manager to manage your Cloudera Hadoop cluster, you can use the Revolution R Enterprise Parcel Generator to create a Cloudera Manager parcel containing additional R packages, and use the resulting parcel to distribute those packages across all the nodes of your cluster.

The Revolution R Enterprise Parcel Generator is a Python script that takes a library of R packages and creates a Cloudera Manager parcel that excludes any base or recommended packages, or packages included with the standard Revolution R Enterprise distribution. Make sure to consider any dependencies your packages might have and be sure to include those in your library. If you installed Revolution R Enterprise with Cloudera Manager parcels, you will find the Parcel Generator in the

Revo.home()/scripts directory. (You may need to ensure that the script has execute permission using the chmod command, or you can call it as “python generate_r_parcel.py”.)

When you call the script, you must provide a name and a version number for the resulting parcel, together with the path to the library you would like to package. When choosing a name for your parcel, be sure to pick a name that is unique in your parcel repository (typically /opt/cloudera/parcel-repo). For example, to package the library /home/RevoUser/R/library, you might call the script as follows:

generate_r_parcel.py –p "RevoUserPkgs" –v "0.1" –l /home/RevoUser/R/library

By default, the path to the library you package should be the same as the path to the library on the Hadoop cluster. You can specify a different destination using the –d flag:

generate_r_parcel.py –p "RevoUserPkgs" –v "0.1" \

-l /home/RevoUser/R/library –d /var/RevoShare/RevoUser/library

To distribute and activate your parcel perform the following steps:

1. Copy or move your .parcel and .sha files to the parcel repository on your Cloudera cluster (typically, /opt/cloudera/parcel-repo)

2. Ensure that the .parcel and .sha files are owned by root and have 755 permissions (that is, read, write, and execute permission for root, and read and execute permissions for group and others).

3. In your browser, open Cloudera Manager. 4. Click Hosts in the upper navigation bar to bring up the All Hosts page. 5. Click Parcels to bring up the Parcels page. 6. Click Check for New Parcels. Your new parcel should appear with a Distribute

button. After clicking Check for New Parcels you may need to click on “All Clusters” under the “Location” section on the left to see the new parcel.

16 Creating an R Package Parcel for Cloudera Manager

7. Click the Distribute button for your parcel. The parcel will be distributed to all the nodes of your cluster. When the distribution is complete, the Distribute button is replaced with an Activate button.

8. Click Activate. Activation prepares your parcel to be used by the cluster.

After your parcel is distributed and activated your R packages should be present in the libraries on each node and can be loaded into your next R session.