sas® and hadoop technology: deployment scenarios

28
SAS ® and Hadoop Technology: Deployment Scenarios SAS ® Documentation September 13, 2017

Upload: others

Post on 16-Oct-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SAS® and Hadoop Technology: Deployment Scenarios

SAS® and Hadoop Technology: Deployment Scenarios

SAS® DocumentationSeptember 13, 2017

Page 2: SAS® and Hadoop Technology: Deployment Scenarios

The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS® and Hadoop Technology: Deployment Scenarios. Cary, NC: SAS Institute Inc.

SAS® and Hadoop Technology: Deployment Scenarios

Copyright © 2015, SAS Institute Inc., Cary, NC, USA

All Rights Reserved. Produced in the United States of America.

For a hard copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.

The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated.

U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government’s rights in Software and documentation shall be only those set forth in this Agreement.

SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414

September 2017

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

9.4-P4:hadoopscicg

Page 3: SAS® and Hadoop Technology: Deployment Scenarios

Contents

Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1 • Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Deploying SAS with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1How to Use This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Tips Before You Deploy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Supported Hadoop Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Resources for SAS Viya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Chapter 2 • SAS Data Loader for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5About SAS Data Loader for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5SAS Software on the Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Deployment Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Overview of Deployment Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 3 • SAS High-Performance Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Analytics Cluster Co-Located with Your Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . 14Configure Remote Access to Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Page 4: SAS® and Hadoop Technology: Deployment Scenarios

iv Contents

Page 5: SAS® and Hadoop Technology: Deployment Scenarios

Audience

Audience

This document is for anyone who is interested in learning how to deploy SAS software with Hadoop. Examples include a Hadoop administrator or IT administrator who will install SAS software to work with Hadoop, or a SAS representative who helps customers deploy software.

v

Page 6: SAS® and Hadoop Technology: Deployment Scenarios

vi Audience

Page 7: SAS® and Hadoop Technology: Deployment Scenarios

Chapter 1

Introduction

Deploying SAS with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

How to Use This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Tips Before You Deploy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Understand Your Hadoop Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Understand the SAS Intelligence Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Supported Hadoop Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Resources for SAS Viya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Deploying SAS with HadoopSAS has long supported many different data sources, and Hadoop is no exception. Currently, more than twenty SAS products, solutions, and technology packages interact with Hadoop. Each SAS technology provides different functionality—from accessing and managing Hadoop data to executing in-memory analytics with data in Hadoop.

Because SAS provides multiple options for accessing and processing data in Hadoop, consider the following approaches to deployment:

• You can configure SAS software that enables processing in your Hadoop cluster. To do this, you deploy SAS Embedded Process for Hadoop on the same nodes of your Hadoop cluster. Together with SAS/ACCESS Interface to Hadoop, SAS Embedded Process enables analysts to run programs in Hadoop. This approach eliminates data movement, as SAS performs the analytics where the data is stored and uses the distributed, parallel processing architecture of Hadoop for improved performance.

This guide provides an overview of steps for deploying SAS Data Loader for Hadoop, which takes advantage of SAS Embedded Process and other SAS software running on the Hadoop cluster. For more information, see Chapter 2, “SAS Data Loader for Hadoop,” on page 5.

• You can deploy a SAS High-Performance Analytics environment (analytics cluster) to work with Hadoop. This approach provides the greatest potential for the fastest analytics on very large data sets from Hadoop. The SAS software that makes up the analytics cluster is SAS High-Performance Analytics, SAS LASR Analytic Server, and SAS Plug-ins for Hadoop. Products that can take advantage of this environment include SAS Visual Analytics and SAS Visual Statistics.

1

Page 8: SAS® and Hadoop Technology: Deployment Scenarios

This guide provides an overview of steps for deploying in-memory analytics. For more information, see Chapter 3, “SAS High-Performance Analytics,” on page 9.

• You can configure SAS for basic access to Hadoop so that users can access data in Hadoop, just as they would with any other data source. This approach supports the most products provided by SAS, and any tools that you already have in place can access and manage data from Hadoop.

Because configuring basic access to Hadoop includes post-installation configuration steps for both Base SAS and SAS/ACCESS Interface to Hadoop, this document provides no scenario and no additional planning information. Instead, you can find instructions in SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

T I P For more information about how SAS and Hadoop work together, see SAS and Hadoop Technology: Overview.

How to Use This GuideUse this guide to understand which software is deployed on the Hadoop cluster or as part of the SAS Intelligence Platform. Although an overview of steps is provided for some scenarios, make sure that you refer to the product documentation for details about how to deploy the software.

Tips Before You DeployHere are a few considerations before you get started.

Understand Your Hadoop EnvironmentTo deploy SAS software with Hadoop successfully, consider the following tips:

• Gain working knowledge of the Hadoop distribution that you are using (for example, Cloudera CDH or Hortonworks Data Platform). Make sure you have working knowledge of the Hadoop Distributed File System (HDFS), MapReduce 1, MapReduce 2, YARN, Hive, and HiveServer2 services. Review your YARN configuration. For more information, see the Apache website or the vendor’s website.

• Ensure that the HCatalog, HDFS, HiveServer2, MapReduce, Oozie, Sqoop, and YARN services are running on the Hadoop cluster. SAS software uses these various services, and you need to ensure that the appropriate JAR files are gathered during the configuration.

• Know the location of the MapReduce home.

• Know the host name of the Hive server and the name of the NameNode.

• Determine where the HDFS and Hive servers are running. If the Hive server is not running on the same machine as the NameNode, note the server and port number of the Hive server for future configuration.

• Request permission to restart the MapReduce service.

• Verify that you can run a MapReduce job successfully.

• Understand and verify your Hadoop user authentication.

2 Chapter 1 • Introduction

Page 9: SAS® and Hadoop Technology: Deployment Scenarios

• Understand and verify your security setup.

It is highly recommended that you enable Kerberos or another security protocol for data security.

Verify that you can connect to your Hadoop cluster (HDFS and Hive) from your client machine outside the SAS environment with your defined security protocol.

Understand the SAS Intelligence PlatformThe SAS Intelligence Platform provides the architecture for data management, business intelligence, and analytics. When you deploy SAS software for Hadoop, it is important to understand the different computing tiers and servers that the architecture comprises.

For more information, see SAS Intelligence Platform: Overview.

Supported Hadoop DistributionsSAS supports Hadoop distributions from Cloudera, Hortonworks, IBM BigInsights, MapR, and Pivotal.

For more information about the supported distributions, see SAS 9.4 Support for Hadoop. In addition, see the product documentation or system requirements documentation for SAS products and technologies.

Note: For production environments, customers should seek out a well-supported third-party distribution of Hadoop. This ensures that they can turn to a dedicated Hadoop vendor for assistance with their production Hadoop needs.

Resources for SAS ViyaThis guide provides information about SAS 9.4 and Hadoop. To find information about SAS Viya, see the following resources:

• SAS Viya product page

• SAS Viya documentation page

Resources for SAS Viya 3

Page 10: SAS® and Hadoop Technology: Deployment Scenarios

4 Chapter 1 • Introduction

Page 11: SAS® and Hadoop Technology: Deployment Scenarios

Chapter 2

SAS Data Loader for Hadoop

About SAS Data Loader for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

SAS Software on the Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Deployment Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Overview of Deployment Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

About SAS Data Loader for HadoopNote: The following information is specific to SAS Data Loader 3.1 for Hadoop.

SAS Data Loader for Hadoop makes it easier to move, cleanse, and analyze data in Hadoop. SAS Data Loader for Hadoop consists of a web application, elements of SAS 9.4 Intelligence Platform, and SAS software installed on each node in the Hadoop cluster. The SAS Data Loader for Hadoop web application provides an interactive interface that guides users through the process of creating directives. The directives run on a SAS Workspace Server, which executes generated code, sends code to Hadoop, and receives responses from Hadoop. During execution, directives provide access to generated code, log information, error messages, and results as they become available. Users can save directives, update them, and execute them as needed.

SAS Data Loader for Hadoop requires the server tiers provided by SAS Intelligence Platform.

• SAS Data Loader for Hadoop is deployed as a web application on the SAS middle tier.

• SAS Data Loader for Hadoop and SAS Intelligence Platform software can be deployed on a single-tier or multi-tier environment.

SAS Software on the Hadoop ClusterTo support SAS Data Loader for Hadoop, the following SAS software is installed on the Hadoop cluster:

• SAS Embedded Process, which executes directives in coordination with Hadoop technologies such as Hive, MapReduce, Spark, and Impala

5

Page 12: SAS® and Hadoop Technology: Deployment Scenarios

• SAS Data Loader for Hadoop Spark Engine, which executes data integration and data quality tasks in Apache Spark

• SAS Quality Knowledge Base, which supports data cleansing capabilities in Hadoop

For information about the supported Hadoop distributions, see SAS 9.4 Support for Hadoop. In addition, see the full product documentation or system requirements documentation for SAS products and technologies.

Deployment ConsiderationsAdministrators should consider the following points when preparing to install SAS Data Loader for Hadoop:

• Because SAS Data Loader for Hadoop requires information about the Hadoop cluster and the installation of SAS software on the Hadoop cluster, the SAS administrator should plan the installation with the Hadoop administrator. SAS provides user interfaces to deploy SAS software on the Hadoop cluster, and decisions must be made about who will install SAS software on the Hadoop cluster.

• A Hadoop administrator must be able to provide information about the Hadoop cluster to the SAS administrator or must be involved more directly with the installation.

• To build the SAS Data Loader for Hadoop web application, you must collect the Hadoop JAR and configuration files before SAS Data Loader for Hadoop is deployed. There are two ways to collect the files from the Hadoop cluster:

• A Hadoop administrator can collect these files by using the Hadoop tracer script before the deployment of SAS Data Loader for Hadoop.

• Another approach is to collect the files by using the SAS Deployment Wizard when the SAS software is installed. To collect the files by using the SAS Deployment Wizard, the person running the wizard must have administrator credentials for the Hadoop cluster. If the SAS Deployment Wizard is used, you do not need to run the Hadoop tracer script, which is documented in this topic.

• The primary deployment scenario in SAS Data Loader for Hadoop: Installation and Configuration Guide recommends installing SAS Embedded Process and SAS Data Loader for Hadoop Spark Engine on the Hadoop cluster before deploying SAS Data Loader for Hadoop.

• If Kerberos is used to secure the Hadoop cluster, a Kerberos administrator must perform configuration to support environments that are specific to SAS Data Loader for Hadoop.

Overview of Deployment StepsThe following steps provide an overview for deploying SAS Data Loader for Hadoop. For details, refer to SAS Data Loader for Hadoop: Installation and Configuration Guide.

1. Plan your deployment, including reviewing Hadoop and Kerberos checklists.

2. Install SAS Embedded Process and SAS Data Loader for Hadoop Spark Engine on the Hadoop cluster.

6 Chapter 2 • SAS Data Loader for Hadoop

Page 13: SAS® and Hadoop Technology: Deployment Scenarios

3. Install SAS Data Loader for Hadoop and SAS Intelligence Platform software.

4. Deploy your SAS Quality Knowledge Base on the Hadoop cluster.

5. Update SAS metadata and perform other post-installation configuration.

Overview of Deployment Steps 7

Page 14: SAS® and Hadoop Technology: Deployment Scenarios

8 Chapter 2 • SAS Data Loader for Hadoop

Page 15: SAS® and Hadoop Technology: Deployment Scenarios

Chapter 3

SAS High-Performance Analytics

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9About SAS High-Performance Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Deployment Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Core Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Supporting Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Hadoop Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13SAS Products That Take Advantage of High-Performance Analytics . . . . . . . . . . . 13

Analytics Cluster Co-Located with Your Hadoop Cluster . . . . . . . . . . . . . . . . . . . . 14What Gets Deployed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Overview of Deployment Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Configure Remote Access to Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15What Gets Deployed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Overview of Deployment Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

OverviewYou can deploy software that enables SAS software to process and analyze data in a distributed, in-memory environment. Because different environments are possible, this chapter provides information to help you understand the SAS software that enables high-performance analytics and the general steps for deploying the software.

About SAS High-Performance AnalyticsA key to understanding the advantages of SAS High-Performance Analytics is to understand how data is staged and where analytics takes place. SAS loads data from Hadoop to the analytics cluster, which is where SAS in-memory software is deployed and where SAS processes the data. The analytics cluster then performs the analysis, and only the results are returned to the SAS server or SAS client that submitted the request.

The analytics cluster consists of the SAS High-Performance Analytics environment, SAS LASR Analytic Server, and SAS Plug-ins for Hadoop.When it is deployed, the analytics cluster reduces processing time and brings SAS High-Performance Analytics to data volumes that exceed the memory capacity of a single machine.

To read more about the advantages of using in-memory analytics, see SAS and Hadoop Technology: Overview.

9

Page 16: SAS® and Hadoop Technology: Deployment Scenarios

Deployment ConsiderationsConsider the following options when installing and configuring the analytics cluster:

• You can co-locate the analytics cluster with your Hadoop cluster. An advantage to a co-located deployment is that the Hadoop cluster can be used to stage data in HDFS as SASHDAT tables. The SASHDAT file format is memory efficient, supports SAS formats, and is optimal for use with in-memory processing. If you already have a Hadoop cluster and the resource demands leave capacity for SAS software, consider deploying SAS software on the same machines.

For more information, see “Analytics Cluster Co-Located with Your Hadoop Cluster” on page 14.

• You can configure your analytics cluster to access a remote Hadoop cluster. In this scenario, minimal SAS software is installed in the Hadoop cluster. Specifically, SAS Embedded Process is deployed on the nodes of the Hadoop cluster to handle asymmetric, distributed workloads from remote Hadoop to the SAS analytics cluster. In this scenario, it is recommended that you also deploy co-located Hadoop to stage data in HDFS as SASHDAT tables.

For more information, see “Configure Remote Access to Hadoop” on page 15.

Note: Here are two ways a co-located deployment can help:

• To use SASHDAT tables, a co-located deployment is required. An important feature of a co-located deployment is the ability to stage data on the analytics cluster. The preferred format is a SASHDAT table, because it takes advantage of the redundancy and highly available features of HDFS. SAS LASR Analytic Server and high-performance procedures can read and write SASHDAT tables in parallel at impressive speeds.

• If you use SAS LASR Analytic Server, a co-located deployment is highly recommended. A co-located deployment enables you to read data from operational systems and stage it so that SAS can analyze it on the same machines. One of the most powerful benefits of SAS LASR Analytic Server is the ability to read data in parallel from a co-located data provider. A co-located deployment provides optimal performance when using the in-memory processing.

10 Chapter 3 • SAS High-Performance Analytics

Page 17: SAS® and Hadoop Technology: Deployment Scenarios

Core SoftwareThe primary SAS software that enables distributed, in-memory computing is SAS High-Performance Analytics and SAS LASR Analytic Server.

Table 3.1 Software of the Analytics Cluster

Software Details

SAS High-Performance Analytics environment (also known as SAS High-Performance Analytics node installation)

• Performs analytic tasks in a high-performance environment that is characterized by massively parallel processing (MPP). After you deploy it, a root-and-worker architecture is established for running distributed, high-performance analytics.

• TKGrid is the primary software installed on each node to provide the SAS High-Performance Analytics environment.

SAS LASR Analytic Server • A scalable, analytic platform that provides a secure, multi-user environment for concurrent access to in-memory data.

• Provides the ability to load Hadoop data into memory and perform a variety of distributed processing, exploratory analyses, analytic calculations, and more.

SAS Plug-ins for Hadoop Software that provides services to your supported Hadoop distribution. For a co-located deployment, these services enable the SAS High-Performance Analytics environment to write SASHDAT file blocks evenly across the HDFS file system.

Overview 11

Page 18: SAS® and Hadoop Technology: Deployment Scenarios

Supporting SoftwareSAS provides additional software that can enable access, enhance performance, and facilitate administration of the analytics cluster.

Table 3.2 Supporting Software

Software Details

SAS Embedded Process • Recommended software that enables reading and writing data to HDFS in parallel for SAS High-Performance Analytics.

• SAS Embedded Process is required for in-database functionality with the Hadoop cluster, including scoring acceleration, code acceleration, and the SQL pass-through facility. SAS Embedded Process provides parallel data loading to SAS LASR Analytic Server functionality for Hive and Impala tables and for SPD Engine and SPD Server tables.

• For environments that access Hadoop remotely, SAS Embedded Process is required to handle asymmetric, parallel loads between HDFS and the analytics cluster.

SAS/ACCESS Interface to Hadoop • The required access engine that enables SAS software to interface with Hadoop.

• For environments that access Hadoop remotely, SAS/ACCESS Interface to Hadoop works with SAS Embedded Process to read data from the Hadoop cluster.

SAS High-Performance Computing Management Console

• A console that provides an easy-to-use interface for performing administrative tasks in the analytics cluster.

• This software is optional.

Hadoop JAR and configuration files Files required for the SAS Embedded Process to transfer data to and from the data store and the SAS High-Performance Analytics environment.

Note: A SAS client is required to submit programs to the analytics cluster, and SAS/ACCESS Interface to Hadoop must be installed on the same machine as the SAS client. Data scientists, analytic experts, and other users interface with the SAS client software to write SAS programs and submit them to the SAS High-Performance Analytics environment. An example of a SAS client is SAS Studio.

12 Chapter 3 • SAS High-Performance Analytics

Page 19: SAS® and Hadoop Technology: Deployment Scenarios

Hadoop DistributionsFor information about the supported distributions, see SAS 9.4 Support for Hadoop. In addition, see the full product documentation or system requirements documentation for SAS products and technologies.

SAS Products That Take Advantage of High-Performance AnalyticsIf you intend to use one or more of the following products, consider deploying an analytics cluster.

Table 3.3 SAS Products That Take Advantage of In-Memory Analytics

Software Details

SAS Visual Analytics Use SAS Visual Analytics to explore large volumes of data very quickly to identify patterns and trends and to identify opportunities for further analysis.

Visual Analytics provides an easy-to-use, web-based interface for running analytics.

SAS In-Memory Statistics Use SAS In-Memory Statistics to perform analytical data preparation, variable transformations, exploratory analysis, statistical modeling and machine-learning techniques, integrated modeling comparison, and model scoring.

Data scientists can access data from a variety of sources and use an interactive programming interface to access data in memory. The IMSTAT procedure enables in-memory analytics on the data. SAS LASR Analytic Server holds the data in memory and performs complex analytics.

SAS High-Performance Risk

SAS High-Performance Data Mining

SAS High-Performance Econometrics

SAS High-Performance Optimization

SAS High-Performance Statistics

SAS High-Performance Text-Mining

These products, which provide multiple features for modeling, are collectively known as the SAS High-Performance Analytics products.

Overview 13

Page 20: SAS® and Hadoop Technology: Deployment Scenarios

Analytics Cluster Co-Located with Your Hadoop Cluster

What Gets DeployedIn this scenario, the analytics cluster is deployed on the same nodes with your Hadoop cluster.

The following table shows nodes and roles, as well as locations where analytics software is installed.

Table 3.4 Roles and Software for an Analytics Cluster Co-Located with Hadoop

On the NameNode On Each DataNode

Hadoop

• NameNode

SAS Software

• SAS Root Node: SAS High-Performance Analytics Environment

• SAS High-Performance Management Console (optional)

• SAS Embedded Process

Hadoop

• DataNode

SAS Software

• Worker Node: SAS High-Performance Node

• SAS Embedded Process

Note: Both the Hadoop NameNode and the root node for SAS High-Performance Analytics are installed on the same machine. The root node takes on the role of distributing and coordinating the workload to the worker nodes.

Overview of Deployment Steps

Table 3.5 Overview of Deployment Steps — Co-Located

Steps Where to Find Information

Ensure that machines are configured appropriately, including setting up passwordless SSH, designating ports, and more.

“Chapter 2 — Preparing Your System to Deploy the SAS High-Performance Analytics Infrastructure” in SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide

Deploy SAS High-Performance Computing Management Console.

“Chapter 3 — Deploying SAS High-Performance Computing Management Console” in SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide

14 Chapter 3 • SAS High-Performance Analytics

Page 21: SAS® and Hadoop Technology: Deployment Scenarios

Steps Where to Find Information

If your site wants to use Hadoop as the co-located data store, then you can modify a supported pre-existing Hadoop distribution.

“Chapter 4 — Modifying Co-Located Hadoop” in SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide

Install and configure the SAS High-Performance Analytics environment.

Note: TKGrid is the primary software installed on each node to provide the SAS High-Performance Analytics environment.

“Chapter 5 — Deploying the SAS High-Performance Analytics Environment” in SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide

Deploy SAS Embedded Process. SAS In-Database Products: Administrator’s Guide.

Configure Remote Access to Hadoop

What Gets DeployedIn this scenario, you deploy software on your existing Hadoop cluster. This deployment enables a remote, parallel connection between a Hadoop cluster and a set of machines that is dedicated to the analytics cluster. For this scenario, SAS/ACCESS Interface to Hadoop is installed on a SAS client machine that is remote from the Hadoop cluster and the analytics cluster.

The following table shows SAS Embedded Process software installed in the Hadoop cluster.

Table 3.6 Software Deployed for Remote Access to Hadoop

On the NameNode On Each DataNode

Hadoop

• NameNode

SAS Software

• SAS Embedded Process

Hadoop

• DataNode

SAS Software

• SAS Embedded Process

Configure Remote Access to Hadoop 15

Page 22: SAS® and Hadoop Technology: Deployment Scenarios

Overview of Deployment Steps

Table 3.7 Overview of Deployment Steps — Remote Parallel Connection to Hadoop

Steps Where to Find Information

Install SAS/ACCESS Interface to Hadoop on a SAS client that is remote from the root node of the analytics cluster.

SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS

Deploy SAS Embedded Process on each node in your Hadoop cluster.

SAS In-Database Products: Administrator’s Guide

Configure the analytics environment for a remote parallel connection. Make sure that TKGrid_REP is configured across all nodes in the in-memory analytics cluster. TKGrid_REP is a configuration of TKGrid that enables support for remote access to Hadoop.

“Chapter 6 — Configuring the Analytics Environment for a Remote Parallel Connection” in SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide.

16 Chapter 3 • SAS High-Performance Analytics

Page 23: SAS® and Hadoop Technology: Deployment Scenarios

Recommended Reading

Here is the recommended reading list for this title:

• SAS and Hadoop Technology: Overview

• SAS Data Loader for Hadoop: Installation and Configuration Guide

• SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide

• SAS In-Database Products: Administrator’s Guide

• SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS

For a complete list of SAS publications, go to sas.com/store/books. If you have questions about which titles you need, please contact a SAS Representative:

SAS BooksSAS Campus DriveCary, NC 27513-2414Phone: 1-800-727-0025Fax: 1-919-677-4444Email: [email protected] address: sas.com/store/books

17

Page 24: SAS® and Hadoop Technology: Deployment Scenarios

18 Recommended Reading

Page 25: SAS® and Hadoop Technology: Deployment Scenarios

Index

Cco-located deployment 10

Ddeployment

approaches 1co-located 10high-performance analytics 9remote, parallel connection to Hadoop

10SAS Data Loader for Hadoop 6

HHadoop Distributed File System (HDFS)

2HCatalog 2HDFS (Hadoop Distributed File System)

2high-performance analytics 1

co-located deployment 10explained 9related products 13remote, parallel connection to Hadoop

10, 15software components 11

Hive 2HiveServer2 2

KKerberos 3, 6

MMapReduce 2

OOozie 2

Pprocessing data

basic access 2in a Hadoop cluster 1in memory 1

Rremote, parallel connection to Hadoop

10, 15

SSAS Data Loader for Hadoop

explained 5SAS and Hadoop administrator roles 6SAS Intelligence Platform 5SAS software on the Hadoop cluster 5

SAS Data Loader for Hadoop Spark Engine 6

SAS Embedded Process 5, 12deploy 15explained 1remote, parallel connection to Hadoop

15SAS High-Performance Analytics

environment 1, 11, 14SAS High-Performance Analytics node

installation 11SAS High-Performance Analytics

products 13SAS High-Performance Computing

Management Console 12, 14SAS in-database deployment package for

HadoopSee SAS Embedded Process

SAS In-Memory Statistics 13SAS Intelligence Platform 3SAS LASR Analytic Server 1, 10, 11SAS Plug-ins for Hadoop 1, 11SAS Quality Knowledge Base 6SAS Visual Analytics 1, 13SAS Visual Statistics 1SAS Viya 3

19

Page 26: SAS® and Hadoop Technology: Deployment Scenarios

SAS/ACCESS Interface to Hadoop 12explained 1remote, parallel connection to Hadoop

15SASHDAT tables 10Sqoop 2

TTKGrid 11, 15TKGrid_REP 16

YYARN 2

20 Index

Page 27: SAS® and Hadoop Technology: Deployment Scenarios
Page 28: SAS® and Hadoop Technology: Deployment Scenarios