supplement for hadoop

96
PUBLIC SAP Data Services Document Version: 4.2 Support Package 14 (14.2.14.0) – 2022-05-03 Supplement for Hadoop © 2022 SAP SE or an SAP affiliate company. All rights reserved. THE BEST RUN

Upload: others

Post on 13-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplement for Hadoop

PUBLICSAP Data ServicesDocument Version: 4.2 Support Package 14 (14.2.14.0) – 2022-05-03

Supplement for Hadoop

© 2

022

SAP

SE o

r an

SAP affi

liate

com

pany

. All r

ight

s re

serv

ed.

THE BEST RUN

Page 2: Supplement for Hadoop

Content

1 About this supplement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

2 Naming conventions and variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 Hadoop in Data Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Hadoop sources and targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143.3 Prerequisites to Data Services configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Verify Linux setup with common commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.5 Hadoop support for the Windows platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.6 Configure Hadoop for text data processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Optimize text data processing for the Hadoop framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7 Setting up HDFS and Hive on Windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Apache Impala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1 Download the Cloudera ODBC driver for Impala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Creating an Apache Impala datastore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

5 Connect to HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1 HDFS file location objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

HDFS file location object options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .285.2 HDFS file format objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

HDFS file format options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Configuring custom Pig script results as source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Previewing HDFS file data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39

6 Connect to Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.1 Hive datastores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41

Hive adapter installation and configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42Hive database datastores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2 Pushing the JOIN operation to Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.3 About partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.4 Metadata mapping for Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.5 Hive data type conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.6 Configuring Kerberos authentication for Hive connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Cloudera Data Platform (CDP) Private Cloud Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.1 Checking your CDP Private Cloud Base connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2 Download CDP ODBC client driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2 PUBLICSupplement for Hadoop

Content

Page 3: Supplement for Hadoop

7.3 Configure CDP ODBC client driver in Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.4 Configure CDP Hive DSN-less connections in Windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .677.5 Configure CDP DSN on Windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.6 Configure CDP DSN on Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.7 CDP datastore options for Apache Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .727.8 CDP datastore options for Apache Impala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.9 Configure CDP HDFS file location object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8 Google Cloud Dataproc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .758.1 Configure driver and data source name (DSN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768.2 Hive database datastore for Google Dataproc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.3 Create a WebHDFS file location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Configuring host for WebHDFS file location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

9 SAP Big Data Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .809.1 Setting SSH tunneling with port forwarding on Windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819.2 Setting SSH tunneling with port forwarding on UNIX or Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829.3 ODBC driver requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83

About the ODBC Drivers Selector for Windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Configuring ODBC Hive driver with DS Connection Manager. . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9.4 Generating API tokens for the Hive Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859.5 Generating SWebHdfs delegation token for the HttpFS Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . 869.6 Obtaining an SSL certificate file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869.7 Creating a DSN connection for Windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879.8 Creating a DSN connection for UNIX or Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889.9 Creating a file location for Kerberos-secured Hadoop cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.10 Creating a datastore for Kerberos-secured Hive cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909.11 Upload data to HDFS in the cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Supplement for HadoopContent PUBLIC 3

Page 4: Supplement for Hadoop

1 About this supplement

This supplement describes how to use SAP Data Services to access your Hadoop data for Data Services processes.

Only experienced Data Services and Hadoop users should attempt to perform any of the processes in this supplement.

The supplement discusses the Data Services objects and processes related to accessing your Hadoop account for downloading and uploading data, and the processes for configuring these objects.

Use the following Data Services documentation as companions to this supplement:

● Designer Guide● Reference Guide● Supplement for Adapters● SAP Cloud Platform for Big Data Services

Access all related documentation from our User Assistance Customer Portal.

4 PUBLICSupplement for HadoopAbout this supplement

Page 5: Supplement for Hadoop

2 Naming conventions and variables

This documentation uses specific terminology, location variables, and environment variables that describe various features, processes, and locations in SAP Business Objects and SAP Data Services.

Terminology

SAP Data Services documentation uses the following terminology:

● The terms Data Services system and SAP Data Services mean the same thing.● The term BI platform refers to SAP BusinessObjects Business Intelligence platform.● The term IPS refers to SAP BusinessObjects Information platform services.

NoteData Services requires BI platform components. However, when you don't use other SAP applications, IPS, a scaled back version of BI, also provides these components for Data Services.

● CMC refers to the Central Management Console provided by the BI or IPS platform.● CMS refers to the Central Management Server provided by the BI or IPS platform.

Variables

The following table describes the location variables and environment variables that are necessary when you install and configure Data Services and required components.

Variables Description

INSTALL_DIR The installation directory for SAP applications such as Data Services.

Default location:

● For Windows: C:\Program Files (x86)\SAP BusinessObjects

● For UNIX: $HOME/sap businessobjects

NoteINSTALL_DIR isn't an environment variable. The in­stallation location of SAP software can be different than what we list for INSTALL_DIR based on the location that your administrator sets during installation.

Supplement for HadoopNaming conventions and variables PUBLIC 5

Page 6: Supplement for Hadoop

Variables Description

BIP_INSTALL_DIR The directory for the BI or IPS platform.

Default location:

● For Windows: <INSTALL_DIR>\SAP BusinessObjects Enterprise XI 4.0

ExampleC:\Program Files (x86)\SAP BusinessObjects\SAP BusinessObjects Enterprise XI 4.0

● For UNIX: <INSTALL_DIR>/enterprise_xi40

NoteThese paths are the same for both BI and IPS.

NoteBIP_INSTALL_DIR isn't an environment variable. The installation location of SAP software can be different than what we list for BIP_INSTALL_DIR based on the location that your administrator sets during installation.

<LINK_DIR> An environment variable for the root directory of the Data Services system.

Default location:

● All platforms<INSTALL_DIR>\Data Services

ExampleC:\Program Files (x86)\SAP BusinessObjects\Data Services

6 PUBLICSupplement for Hadoop

Naming conventions and variables

Page 7: Supplement for Hadoop

Variables Description

<DS_COMMON_DIR> An environment variable for the common configuration di­rectory for the Data Services system.

Default location:

● If your system is on Windows (Vista and newer):<AllUsersProfile>\SAP BusinessObjects\Data Services

NoteThe default value of <AllUsersProfile> environ­ment variable for Windows Vista and newer is C:\ProgramData.

ExampleC:\ProgramData\SAP BusinessObjects\Data Services

● If your system is on Windows (Older versions such as XP)<AllUsersProfile>\Application Data\SAP BusinessObjects\Data Services

NoteThe default value of <AllUsersProfile> environ­ment variable for Windows older versions is C:\Documents and Settings\All Users.

ExampleC:\Documents and Settings\All Users\Application Data\SAP BusinessObjects\Data Services

● UNIX systems (for compatibility)<LINK_DIR>

The installer automatically creates this system environment variable during installation.

NoteStarting with Data Services 4.2 SP6, users can desig­nate a different default location for <DS_COMMON_DIR> during installation. If you can't find the <DS_COMMON_DIR> in the listed default location, ask

Supplement for HadoopNaming conventions and variables PUBLIC 7

Page 8: Supplement for Hadoop

Variables Description

your System Administrator to find out where your de­fault location is for <DS_COMMON_DIR>.

<DS_USER_DIR> The environment variable for the user-specific configuration directory for the Data Services system.

Default location:

● If you're on Windows (Vista and newer):<UserProfile>\AppData\Local\SAP BusinessObjects\Data Services

NoteThe default value of <UserProfile> environment variable for Windows Vista and newer versions is C:\Users\{username}.

● If you're on Windows (Older versions such as XP):<UserProfile>\Local Settings\Application Data\SAP BusinessObjects\Data Services

NoteThe default value of <UserProfile> environment variable for Windows older versions is C:\Documents and Settings\{username}.

NoteThe system uses <DS_USER_DIR> only for Data Services client applications on Windows. UNIX plat­forms don't use <DS_USER_DIR>.

The installer automatically creates this system environment variable during installation.

8 PUBLICSupplement for Hadoop

Naming conventions and variables

Page 9: Supplement for Hadoop

3 Apache Hadoop

Use SAP Data Services to connect to Apache Hadoop frameworks including Hadoop Distributive File Systems (HDFS) and Hive.

Data Services supports Hadoop on both the Linux and Windows platform. For Windows support, Data Services uses Hortonworks Data Platform (HDP) only. HDP allows data from many sources and formats. See the latest Product Availability Matrix (PAM) on the SAP Support Portal for the supported versions of HDP.

For information about deploying Data Services on a Hadoop MapR cluster machine, see SAP Note 2404486 .

For information about accessing your Hadoop in the administered SAP Big Data Services, see the Supplement for SAP Big Data Services.

The following table describes the relevant components of Hadoop:

Component Description

HDFS (Hadoop Distributed File Sys­tem)

A distributive file system that stores data on nodes, providing high aggregate bandwidth across the cluster.

Hive A data warehouse infrastructure that allows SQL-like on demand querying of data, in any format, stored in Hadoop.

Impala An SQL query engine for data stored in your HDFS.

Pig A high-level data flow language and execution framework for parallel computation that is built on top of Hadoop. Data Services uses Pig scripts to read from and write to HDFS, including join and push-down operations.

Map/Reduce A computational paradigm where the application is divided into many small frag­ments of work. Each fragment may be executed or re-executed on any node in the cluster. Data Services uses map/reduce to do text data processing.

The following table describes all of the objects related to Hadoop in Data Services that you use to work with your Hadoop data.

Hadoop objects and tools

Object Description

Hive adapter Enables Data Services to connect to a Hive server so that you can work with data from Hadoop.

For complete information about using Data Services adapt­ers, see the Supplement for Adapters.

Supplement for HadoopApache Hadoop PUBLIC 9

Page 10: Supplement for Hadoop

Object Description

Hive datastore Enables Data Services to access data from your Hive data warehouse to use as a source or a target in Data Services processing.

Also use a Hive datastore to access your Hadoop clusters in Google Cloud Dataproc for processing in Data Services.

There are two types of Hive datastores:

● Hive adapter datastore: Use with the Hive adapter.

NoteTo use the Hive adapter datastore, install Data Services on the machine within the Hadoop cluster.

NoteFor complete information about adapters and cre­ating an adapter datastore, see the Supplement for Adapters and the Reference Guide.

● Hive database datastore.

NoteInstall Data Services on any machine.

To access a remote Hive server, configure the Hive data­base datastore with a supported Hive ODBC driver and a data source name (DSN).

HDFS file format Contains a description for your HDFS file system structure.

NoteTo use as a source or target, install Data Services within the Hadoop cluster. Use an HDFS file location to connect to HDFS when Data Services is installed outside of the Hadoop cluster.

HDFS file location Contains the transfer protocol to your HDFS.

Associate a file format with the HDFS file location and use as a source or target in a data flow. Use a file format that is not an HDFS file format. For example, use a flat file format.

NoteInstall Data Services on any machine.

10 PUBLICSupplement for Hadoop

Apache Hadoop

Page 11: Supplement for Hadoop

Object Description

WebHDFS file location Contains the transfer protocol to your HDFS using REST API.

Configure a Hive target object to bulk load processed data to your Hadoop clusters in Google Cloud Dataproc.

Hive template table Use a Hive template table as a target using one of the follow­ing two methods:

● Method 1: Use a Hive datastore template table from the Datastore tab in the Designer object library.

● Method 2: Use a template table from the Designer tool palette.

After you have used a Hive template table in an executed data flow, you can use the target Hive template as a source in a data flow.

Hadoop in Data Services [page 12]SAP Data Services has added support for Hadoop in stages, with features added in specific Data Services versions.

Hadoop sources and targets [page 14]Use SAP Data Services objects that you configure for Hive or Hadoop Distributive File System (HDFS) as sources and targets in data flows.

Prerequisites to Data Services configuration [page 15]Before configuring SAP Data Services to connect to Hadoop, verify that your system configuration is correct.

Verify Linux setup with common commands [page 16]Use common commands to verify that the configuration of your SAP Data Services system on Windows for Hadoop is correct.

Hadoop support for the Windows platform [page 18]SAP Data Services supports Hadoop on the Windows platform using Hortonworks.

Configure Hadoop for text data processing [page 19]SAP Data Services supports text data processing in the Hadoop framework using a MapReduce form of the Entity Extraction transform.

Setting up HDFS and Hive on Windows [page 21]Set system environment variables and use command prompts to configure HDFS and Hive for Windows.

Related Information

Connect to HDFS [page 27]Connect to Hive [page 40]

Supplement for HadoopApache Hadoop PUBLIC 11

Page 12: Supplement for Hadoop

3.1 Hadoop in Data Services

SAP Data Services has added support for Hadoop in stages, with features added in specific Data Services versions.

Use the following table to determine how your version of Data Services supports Hadoop.

Hadoop features in chronological order

Data Services version Hadoop support More information

4.2 SP1 Connect to Hive and HDFS. Linux only.

Requires the following Data Services objects:

● HDFS file format● Hive adapter● Hive adapter datastore

Supplement for Adapters.

4.2 SP2 For the Hive adapter datastore, migrate to the following Hive server versions:

● Hive Server version 2● Hive Server subversion 0.11 and

later

For current version information, consult the Product Availability Matrix (PAM).

4.2 SP3 Preview Hive table data Supplement for Hadoop

4.2 SP4 Preview HDFS data Supplement for Hadoop

4.2 SP5 Use SQL functions and the SQL trans­form in data flows with data imported with the Hive datastore.

Supplement for Hadoop

4.2 SP6 Support for:

● Hive template table● SASL-QoP with Kerberos on Hive

(Simple Authentication and Secur­ity Layer - Quality of Protection)

● Windows support for Hadoop in Data Services

● Hive support for Varchar and Char data types

Supplement for Hadoop

Reference Guide

4.2 SP7 Support for:

● Hive on Spark● Hive Beeline CLI replaces Hive CLI● Hive CLI is deprecated

Supplement for Hadoop

12 PUBLICSupplement for Hadoop

Apache Hadoop

Page 13: Supplement for Hadoop

Data Services version Hadoop support More information

4.2 SP8 Support for:

● Edge nodes running on Hive clus­ters (Linux only)

● Hive template tables with Parquet, AVRO, and ORC table formats.

Supplement for Hadoop

4.2 SP9 Connect to Hadoop on SAP Big Data Services

Supplement for SAP Big Data Services

4.2 SP10 HDFS file location object supports:

● Reading and loading to HDFS sys­tem with optional Kerberos au­thentication

● Installing Data Services outside the Hadoop cluster for HDFS and Hive reading and loading

Hive datastore enhancements:

● Connecting to Hive server re­motely with supported Hive ODBC drivers

● Installing Data Services outside of the Hadoop cluster

● Kerberos● Bulk loading to Hive

Supplement for Adapters

Supplement for Hadoop

Reference Guide

4.2 SP11 Access the HDFS with Knox Gateway through the HDFS file location object.

Supplement for Hadoop

Reference Guide

For more information about setting up your Hadoop with Knox gateway, see your Apache Knox documentation.

4.2 SP12 Access Kerberos-secure Hadoop clus­ter in SAP Big Data Services.

DSN and server name (DSN-less) con­nections to Hive Server 2.

Supplement for Hadoop

Supplement for SAP Big Data Services.

4.2 SP12 Patch 1 (14.02.12.01) Access Hadoop clusters in Google Cloud Dataproc using a Hive database datastore.

Upload generated data from Data Serv­ices using a WebHDFS file location.

Supplement for Hadoop

Supplement for HadoopApache Hadoop PUBLIC 13

Page 14: Supplement for Hadoop

Data Services version Hadoop support More information

4.2 SP14 Patch 6 (14.02.14.06) Cloudera Data Platform (CDP) Private Cloud Base 7.1 and later versions.

Supplement for Hadoop

For information about CDP, see yourCloudera Docs .

For current version information for CDP and Cloudera ODBC driver, consult the Product Availability Matrix (PAM).

Parent topic: Apache Hadoop [page 9]

Related Information

Hadoop sources and targets [page 14]Prerequisites to Data Services configuration [page 15]Verify Linux setup with common commands [page 16]Hadoop support for the Windows platform [page 18]Configure Hadoop for text data processing [page 19]Setting up HDFS and Hive on Windows [page 21]

3.2 Hadoop sources and targets

Use SAP Data Services objects that you configure for Hive or Hadoop Distributive File System (HDFS) as sources and targets in data flows.

To access data from Hive, use objects that are designed for Hive. For example, use the Hive adapter datastore for jobs that use data from your Hive storage. When you want data from your HDFS, use an HDFS file format or HDFS file location object.

Use other Data Services objects along with Hadoop objects in data flows based on your objectives.

Example● Configure a data source name (DSN) using a supported Hive ODBC driver to create a Hive datastore

that accesses a remote Hive server.● Configure a flat file with an HDFS file location object and use as a source or target in a data flow.● Use an HDFS file location object and a script to access data from a remote source with the

copy_from_remote_system function.● Use an HDFS file location object and a script to upload data from your local server to a remote server

using the copy_to_remote_system function.● Use bulk loading to upload data to Hive or HDFS. Works with a flat file, Hive template table, or a table

within the Hive datastore as a target in a data flow.

14 PUBLICSupplement for Hadoop

Apache Hadoop

Page 15: Supplement for Hadoop

Parent topic: Apache Hadoop [page 9]

Related Information

Hadoop in Data Services [page 12]Prerequisites to Data Services configuration [page 15]Verify Linux setup with common commands [page 16]Hadoop support for the Windows platform [page 18]Configure Hadoop for text data processing [page 19]Setting up HDFS and Hive on Windows [page 21]Connect to HDFS [page 27]Connect to Hive [page 40]Creating a DSN connection with SSL protocol in Windows [page 51]Configuring bulk loading for Hive [page 53]

3.3 Prerequisites to Data Services configuration

Before configuring SAP Data Services to connect to Hadoop, verify that your system configuration is correct.

Ensure that your Data Services system configuration meets the following prerequisites.

For Linux and Windows platforms:

● You configure the machine where the Data Services Job Server is installed to work with Hadoop.● The machine where the Data Services Job Server is installed has the Pig client installed.● If you use Hive, verify that the Hive client is installed:

1. Log in to the node.2. Issue Pig and Hive commands to start the respective interfaces.

● The Data Services Job Server is installed on one of the Hadoop cluster machines, which can be either an Edge (Linux only) or a Data node. To install Data Services to any machine, including the machine in the cluster, use one of the following methods:○ Use a supported ODBC driver and configure a DSN for the Hive adapter datastore.○ Set up jobs using the HDFS file location object.

Create the HDFS file location object using either the WebHDFS or the HTTPFS connection protocols.● If you use text data processing, ensure that you copy the necessary text data processing components to

the HDFS to enable MapReduce functionality.

For Linux platforms:

● You set the environment for interaction with Hadoop.● You start the Job Server from an environment that sources the Hadoop environment script.

Supplement for HadoopApache Hadoop PUBLIC 15

Page 16: Supplement for Hadoop

ExampleFor example:

source <LINK_DIR>/hadoop/bin/hadoop_env_setup.sh -e

Parent topic: Apache Hadoop [page 9]

Related Information

Hadoop in Data Services [page 12]Hadoop sources and targets [page 14]Verify Linux setup with common commands [page 16]Hadoop support for the Windows platform [page 18]Configure Hadoop for text data processing [page 19]Setting up HDFS and Hive on Windows [page 21]Setting UNIX environment variablesHDFS file location object options [page 28]Hive database datastores [page 49]

3.4 Verify Linux setup with common commands

Use common commands to verify that the configuration of your SAP Data Services system on Windows for Hadoop is correct.

When you use the commands in this topic, your output may be different from what we show. If your output is different, it is okay as long as your commands do not result in errors.

Setting up the environment

To set up the Data Services environment for Hadoop, use the following command:

$ cd <DS Install Directory>/bin $ source ./al_env.sh$ cd ../hadoop/bin $ source ./hadoop_env_setup.sh -e

16 PUBLICSupplement for Hadoop

Apache Hadoop

Page 17: Supplement for Hadoop

Checking components

Ensure that Hadoop, Pig, and Hive are installed and correctly configured on the machine where Data Services Job Server for Hadoop resides.

Check the Hadoop, Pig, and Hive configuration by using the following command:

$ hadoop fs -ls /

For Hadoop, you should see output similar to the following:

$ hadoop fs -ls / Found 2 itemsdrwxr-xr-x - hadoop supergroup 0 2013-03-21 11:47 /tmpdrwxr-xr-x - hadoop supergroup 0 2013-03-14 02:50 /user

For Pig, you should see output similar to the following:

$ pig INFO org.apache.pig.Main - Logging error messages to: /hadoop/pig_1363897065467.logINFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://machine:9000INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: machine:9001grunt> fs -ls /Found 2 itemsdrwxr-xr-x - hadoop supergroup 0 2013-03-21 11:47 /tmpdrwxr-xr-x - hadoop supergroup 0 2013-03-14 02:50 /user grunt> quit

For Hive, you should see output similar to the following:

$ hive Hive history file=/tmp/hadoop/hive_job_log_hadoop_201303211318_504071234.txthive> show databases;OKdefaultTime taken: 1.312 seconds hive> quit;

Set up or restart the Job Server

If all commands pass, use the following command from within the same shell to set up or restart the Job Server.

<LINK_DIR>/bin/svrcfg

This command provides the Job Server with the proper environment from which it starts engines that call Hadoop, Pig, and Hive.

Parent topic: Apache Hadoop [page 9]

Supplement for HadoopApache Hadoop PUBLIC 17

Page 18: Supplement for Hadoop

Related Information

Hadoop in Data Services [page 12]Hadoop sources and targets [page 14]Prerequisites to Data Services configuration [page 15]Hadoop support for the Windows platform [page 18]Configure Hadoop for text data processing [page 19]Setting up HDFS and Hive on Windows [page 21]

3.5 Hadoop support for the Windows platform

SAP Data Services supports Hadoop on the Windows platform using Hortonworks.

Use the supported version of Hortonworks HDP only. See the Product Availability Matrix (PAM) on the SAP Support Portal for the most recent supported version number.

When you use Hadoop on the Windows platform, use Data Services to do the following tasks:

● Use Hive tables as a source or target in your data flows.● Use HDFS files as a source or target in your data flows using Pig script or the HDFS library API.● Use HDFS file location object as a source or target in your data flows. Pig Script or HDFS library API is not

required.● Stage non-Hive data in a data flow using the Data_Transfer transform.● Preview data for HDFS files and Hive tables.

Requirements

Make sure that you set up your system as follows:

● Install the Data Services Job Server in one of the nodes of the Hadoop cluster.

NoteInstall Data Services on any machine when you use an HDFS file location object with one of the following connection protocols:○ WebHDFS○ HTTPFS

Alternatively, use a Hive database datastore configured with DSN and a supported Hive ODBC driver.

● Set the system environment variables, such as PATH and CLASSPATH, so that the Job Server can run as a service.

● Set the permission requirements for the HDFS file system to use HDFS or Hive.

Parent topic: Apache Hadoop [page 9]

18 PUBLICSupplement for Hadoop

Apache Hadoop

Page 19: Supplement for Hadoop

Related Information

Hadoop in Data Services [page 12]Hadoop sources and targets [page 14]Prerequisites to Data Services configuration [page 15]Verify Linux setup with common commands [page 16]Configure Hadoop for text data processing [page 19]Setting up HDFS and Hive on Windows [page 21]Connect to HDFS [page 27]Previewing HDFS file data [page 39]HDFS file location object options [page 28]Hive datastores [page 41]Hive database datastores [page 49]Pushing the JOIN operation to Hive [page 56]

3.6 Configure Hadoop for text data processing

SAP Data Services supports text data processing in the Hadoop framework using a MapReduce form of the Entity Extraction transform.

To use text data processing in Hadoop, run the following Hadoop environment script.

<LINK_DIR>/hadoop/bin/hadoop_env_setup.sh -c

The script copies the language modules and other dependent libraries to the Hadoop file system so that MapReduce can distribute them during the MapReduce job setup. You only have to do this file-copying operation once after an installation or update, or when you want to use custom dictionaries or rule files.

If you use the Entity Extraction transform with custom dictionaries or rule files, copy the custom dictionaries or rule files to the Hadoop file system for distribution. To do so, first copy the files into the languages directory of the Data Services installation, then rerun the Hadoop environment script. For example:

cp /myhome/myDictionary.nc <LINK_DIR>/TextAnalysis/languages

<LINK_DIR>/hadoop/bin/hadoop_env_setup.sh -c

After you complete the Hadoop environment set up, configure the Entity Extraction transform to push down operations to the Hadoop system by connecting it to a single HDFS Unstructured Text source.

Optimize text data processing for the Hadoop framework [page 20]To control the mapper settings for text data processing in the Hadoop framework, use a configuration setting.

Parent topic: Apache Hadoop [page 9]

Supplement for HadoopApache Hadoop PUBLIC 19

Page 20: Supplement for Hadoop

Related Information

Hadoop in Data Services [page 12]Hadoop sources and targets [page 14]Prerequisites to Data Services configuration [page 15]Verify Linux setup with common commands [page 16]Hadoop support for the Windows platform [page 18]Setting up HDFS and Hive on Windows [page 21]

3.6.1 Optimize text data processing for the Hadoop framework

To control the mapper settings for text data processing in the Hadoop framework, use a configuration setting.

Use the following Hadoop configuration setting to control the amount of data a mapper can handle and the number of mappers used by a job: mapred.max.split.size .

Set the value for mapred.max.split.size in the Hadoop configuration file. The Hadoop configuration file is located at $HADOOP_HOME/conf/core-site.xml. The Hadoop configuration file could be located in an alternate location, depending on the type of Hadoop you use.

By default, the value for mapred.max.split.size is 0. When you keep the default for this configuration setting:

● The software does not limit the amount of data the mapper handles.● The software runs text data processing with only one mapper.

Change the default configuration value to the amount of data that each mapper can handle.

ExampleA Hadoop cluster contains 20 machines. Each machine is set to run a maximum of 10 mappers. 20 machines x 10 mappers = 200 mappers available in the cluster.

Your input data averages 200 GB. To have the text data processing job consume 100 percent of the available mappers, set mapred.max.split.size to 1073741824 (1 GB).

Calculation: 200 GB ÷ 200 mappers = 1 GB per mapper.

<property> <name>mapred.max.split.size</name> <value>1073741824</value> </property>

To have the text data processing job consume 50 percent of the available mappers, set mapred.max.split.size to 2147483648 (2 GB).

Calculation: 200 GB ÷ 100 mappers = 2 GB per mapper.

<property> <name>mapred.max.split.size</name> <value>2147483648</value>

20 PUBLICSupplement for Hadoop

Apache Hadoop

Page 21: Supplement for Hadoop

</property>

Parent topic: Configure Hadoop for text data processing [page 19]

3.7 Setting up HDFS and Hive on Windows

Set system environment variables and use command prompts to configure HDFS and Hive for Windows.

Install the SAP Data Services Job Server component.

Perform the following steps to set up HDFS and Hive on Windows:

1. Set the following system environment variable:

HDFS_LIB_DIR = /sap/dataservices/hadoop/tdp

2. Add the following to the PATH: <LINK_DIR>\ext\jre\bin\server.

3. Run the following command:

hadoop classpath --jar c:\temp\hdpclasspath.jar

4. Update CLASSPATH=%CLASSPATH%; c:\temp\hdpclasspath.jar.

CLASSPATH generates the Hadoop and Classpath .jar files.

5. Set the location of the Hadoop and Classpath .jar files.

6. When the Hadoop CLASSPATH command completes successfully, check the content of the .jar file for the Manifest file.

7. Check that the hdfs.dll has symbols exported.

If the symbols from the hdfs.dll are not exported, install the fix from Hortonworks for the export of symbols. If the symbols from the .dll are still not exported, use the .dll from Hortonworks 2.3.

8. Required only if you use Text Data Processing transforms in jobs, and only once per Data Services install: Run the following command from <LINK_DIR>\bin:

Hadoop_env_setup.bat

The .bat file copies the Text Analysis Language file to the HDFS cache directory.

9. Ensure that the Hadoop or Hive .jar files are installed. The Data Services Hive adapter uses the following .jar files:

○ commons-httpclient-3.0.1.jar ○ commons-logging-1.1.3.jar○ hadoop-common-2.6.0.2.2.6.0-2800.jar○ hive-exec-0.14.0.2.2.6.0-2800.jar○ hive-jdbc-0.14.0.2.2.6.0-2800-standalone.jar○ hive-jdbc-0.14.0.2.2.6.0-2800.jar○ hive-metastore-0.14.0.2.2.6.0-2800.jar

Supplement for HadoopApache Hadoop PUBLIC 21

Page 22: Supplement for Hadoop

○ hive-service-0.14.0.2.2.6.0-2800.jar○ httpclient-4.2.5.jar○ httpcore-4.2.5.jar ○ libfb303-0.9.0.jar ○ log4j-1.2.16.jar ○ slf4j-api-1.7.5.jar ○ slf4j-log4j12-1.7.5.jar

10. Run the following commands to set up the permissions on the HDFS file system:

hdfs dfs -chmod -R 777 /mapred hdfs dfs –mkdir /tmphdfs dfs –chmod –R 777 /tmphdfs dfs -mkdir /tmp/hive/ hdfs dfs -chmod -R 777 /tmp/hivehdfs dfs –mkdir –p /sap/dataservices/hadoop/tdphdfs dfs -mkdir -p /user/hive hdfs dfs -mkdir -p /hive/warehousehdfs dfs -chown hadoop:hadoop /user/hivehdfs dfs -chmod -R 755 /user/hive hdfs dfs -chmod -R 777 /hive/warehouse

Task overview: Apache Hadoop [page 9]

Related Information

Hadoop in Data Services [page 12]Hadoop sources and targets [page 14]Prerequisites to Data Services configuration [page 15]Verify Linux setup with common commands [page 16]Hadoop support for the Windows platform [page 18]Configure Hadoop for text data processing [page 19]Data Services adaptersHive datastores [page 41]

22 PUBLICSupplement for Hadoop

Apache Hadoop

Page 23: Supplement for Hadoop

4 Apache Impala

Create an ODBC datastore to connect to Apache Impala in Hadoop.

Before you create an Apache Impala datastore, download the Cloudera ODBC driver and create a data source name (DSN). Use the datastore to connect to Hadoop and import Impala metadata. Use the metadata as a source or target in a data flow.

Before you work with Apache Impala, be aware of the following limitations:

● SAP Data Services supports Impala 2.5 and later.● SAP Data Services supports only Impala scalar data types. Data Services does not support complex types

such as ARRAY, STRUCT, or MAP.

For more information about ODBC datastores, see the Datastores section in the Designer Guide.

For descriptions of common datastore options, see the Designer Guide.

Download the Cloudera ODBC driver for Impala [page 23]For Linux users. Before you create an Impala database datastore, connect to Apache Impala using the Cloudera OBDC driver.

Creating an Apache Impala datastore [page 25]To connect to your Hadoop files and access Impala data, create an ODBC datastore in SAP Data Services Designer.

4.1 Download the Cloudera ODBC driver for Impala

For Linux users. Before you create an Impala database datastore, connect to Apache Impala using the Cloudera OBDC driver.

Perform the following high-level steps to download a Cloudera ODBC driver and create a data source name (DSN). For more in-depth information, consult the Cloudera documentation.

1. Enable Impala Services on the Hadoop server.2. Download and install the Cloudera ODBC driver (https://www.cloudera.com/downloads/connectors/

impala/odbc/2-5-26.html ):

Select the driver that is compatible with your platform. For information about the correct driver versions, see the SAP Product Availability Matrix (PAM).

3. Start DSConnectionManager.sh.

Either open the file or run the following command:

cd $LINK_DIR/bin/ $ ./DSConnectionManager.sh

Supplement for HadoopApache Impala PUBLIC 23

Page 24: Supplement for Hadoop

ExampleThe following shows prompts and values in DS Connection Manager that includes Kerberos and SSL:

The ODBC ini file is <path to the odbc.ini file> There are available DSN names in the file: [DSN name 1] [DSN name 2] Specify the DSN name from the list or add a new one: <New DSN file name> Specify the User Name: <Hadoop user name> Type database password:(no echo) *Type the Hadoop password. Password does not appear after you type it for security. Retype database password:(no echo) Specify the Host Name: <host name/IP address> Specify the Port:'21050' <port number> Specify the Database: default Specify the Unix ODBC Lib Path: *The Unix ODBC Lib Path is based on where you install the driver. *For example, /build/unixODBC-2.3.2/lib. Specify the Driver: /<path>/lib/64/libclouderaimpalaodbc64.so Specify the Impala Auth Mech [0:noauth|1:kerberos|2:user|3:user-password]:'0': 1 Specify the Kerberos Host FQDN: <hosts fully qualified domain name> Specify the Kerberos Realm: <realm name> Specify the Impala SSL Mode [0:disabled | 1:enabled]:'0' 1 Specify the Impala SSL Server Certificate File: <path to certificate.pem> Testing connection... Successfully added database source.

Task overview: Apache Impala [page 23]

Related Information

Creating an Apache Impala datastore [page 25]

24 PUBLICSupplement for Hadoop

Apache Impala

Page 25: Supplement for Hadoop

4.2 Creating an Apache Impala datastore

To connect to your Hadoop files and access Impala data, create an ODBC datastore in SAP Data Services Designer.

Before performing the following steps, enable Impala Services on your Hadoop server. Then download the Cloudera driver for your platform.

NoteIf you didn't create a DSN (data source name) in Windows ODBC Data Source application, you can create a DSN in the following process.

To create an ODBC datastore for Apache Impala, perform the following steps in Designer:

1. Select Tools New Datastore.

The datastore editor opens.2. Choose Database from the Datastore Type dropdown list.3. Choose ODBC from the Database Type dropdown list.4. Select ODBC Admin.

The ODBC Data Source Administrator opens.5. Open the System DSN tab and select the Cloudera driver that you downloaded from the System Data

Sources list.6. Select Configure.

The Cloudera ODBC Driver for Impala DSN Setup dialog box opens.7. Enter the required information based on your system8. Select Advanced Options.9. Check Use SQL Unicode Types.10. Close Advanced Options.11. Optional: Enable Kerberos authentication by performing the following substeps:

a. Choose Kerberos from the Mechanism list.b. Enter the name of the applicable realm in Realm.

A realm is a set of managed nodes that share the same Kerberos database.c. Enter the fully qualified domain name (FQDN) of the Hive Server host in Host FQDN.d. Enter the service principal name for the Hive server in Service Name.e. Enable the Canonicalize Principal FQDN option.

This option canonicalizes the host FQDN in the server principal name.12. Optional: To enable Secure Sockets Layer (SSL) protocol, perform the following substeps:

a. Choose No Authentication (SSL) from the Mechanism list.b. Select Advanced Options.c. Enter or browse to the Cloudera certificate file in Trusted Certificates.

The default path to the Impala certificate.pem file automatically populates.d. Close Advanced Options.

Supplement for HadoopApache Impala PUBLIC 25

Page 26: Supplement for Hadoop

13. Close the Cloudera ODBC Driver for Impala DSN Setup dialog box and the ODBC Data Source Administrator.

14. Select the Cloudera DSN that you created from the Data Source Name list.

The DSN appears in the dropdown list only after you've created it.15. Select Advanced and complete the advanced options as necessary.

a. Optional: Set the Code page option to utf-8 in the Locale group to process multibyte data in Impala tables.

b. Optional: In the ODBC Date Function Support group, set the Week option to No.

If you don’t set the Week option to No, the result of the Data Services built-in function week_in_year() may be incorrect.

Task overview: Apache Impala [page 23]

Related Information

Download the Cloudera ODBC driver for Impala [page 23]

26 PUBLICSupplement for Hadoop

Apache Impala

Page 27: Supplement for Hadoop

5 Connect to HDFS

Connect to your Hadoop Distributed File System (HDFS) data using an HDFS file format or an HDFS file location.

An HDFS file format and an HDFS file location contain your HDFS connection information, including account name, password, security protocol, and so on. Data Services uses this information to access HDFS data during Data Services processing.

Decide which object to use based on the location of your Data Services installation:

● Use an HDFS file format when Data Services is installed within the Hadoop cluster.● Use an HDFS file location when Data Services is installed anywhere, including within the Hadoop cluster.

If your Hadoop system is managed in SAP Big Data Service (formerly Altiscale), your connection setup uses information from your Big Data Service account. For complete instructions to connect to your Big Data Service account, see the Supplement for SAP Big Data Service.

HDFS file location objects [page 27]To use HDFS data as a source, or to upload generated data to HDFS, use an HDFS file location as a source or target in a data flow along with a flat file or template table.

HDFS file format objects [page 32]An HDFS file format stores connection information to an HDFS file.

Related Information

HDFS file location object options [page 28]Upload data to HDFS in the cloud [page 92]

5.1 HDFS file location objects

To use HDFS data as a source, or to upload generated data to HDFS, use an HDFS file location as a source or target in a data flow along with a flat file or template table.

When you create the file location:

● Enter the file transfer protocol specifics for your HDFS.● Define a local and remote server for which you have access permission.

Data Services uses the remote and local server information and the file transfer protocol in a data flow to move data between the local and remote server.

For information about running a Hadoop file location job to append data to HDFS file on a small cluster, see SAP Note 2771182 .

Supplement for HadoopConnect to HDFS PUBLIC 27

Page 28: Supplement for Hadoop

For more information about file location objects, see the Designer Guide.

HDFS file location object options [page 28]Use a Hadoop distributed file system (HDFS) file location to access your Hadoop data for Data Services processing.

Parent topic: Connect to HDFS [page 27]

Related Information

HDFS file format objects [page 32]

HDFS file location object options [page 28]

5.1.1 HDFS file location object options

Use a Hadoop distributed file system (HDFS) file location to access your Hadoop data for Data Services processing.

Use the HDFS file location as a source or target in an SAP Data Services data flow.

When you create a new HDFS file location, select HDFS from the Protocol dropdown list.

The following table describes the file location options that are specific to the HDFS protocol. For descriptions of general options, see the Reference Guide.

Option Description

Connection section

Protocol Type of file transfer protocol.

Select HDFS.

Communication Protocol Type of protocol to use to access the data in your HDFS.

● WebHDFS: Select when Data Services is not installed as a part of the Hadoop cluster. Ensure that you configure WebHDFS on your server side.

● HTTPFS: Select when Data Services is not installed as a part of the Hadoop cluster.

● HDFS: Select when Data Services is installed as a part of the Hadoop cluster.

Host Name of the computer that hosts the NameNode.

Secondary NameNode Name of the computer that hosts the secondary NameNode.

28 PUBLICSupplement for Hadoop

Connect to HDFS

Page 29: Supplement for Hadoop

Option Description

Port Port number on which the NameNode listens.

User Hadoop user name.

Password Password for the WebHDFS communication protocol method. Required for Knox gateway and topology.

Compression type Specifies not to use compression, or to use gzip compres­sion:

● None: Default setting. The file location object does not use compression.

● gzip: The file location object uses gzip compression. The software compresses the files before upload to Ha­doop and decompresses the files after download from Hadoop.

Not applicable when you select HDFS type for Communication protocol.

Connection retry count Number of times the computer tries to create a connection with the remote server after a connection fails.

The default value is 10.

The value cannot be zero.

After the specified number of retries, Data Services issues an error message and stops the job.

Not applicable when you select HDFS type for Communication protocol.

Batch size for uploading data (MB) Size of the data transfer in MB to use for uploading data.

The default value is 5.

Data Services uses different upload methods based on file size:

● Single part uploads for files less than 5 MB.● Multi part uploads for files larger than 5 MB.

Data Services limits the total upload batch size to 100 MB.

Not applicable when you select HDFS type for Communication protocol.

Supplement for HadoopConnect to HDFS PUBLIC 29

Page 30: Supplement for Hadoop

Option Description

Batch size for downloading data MB Size of the data transfer in MB the software uses to down­load data from Hadoop.

The default value is 5.

Not applicable when you select HDFS type Communication protocol.

Number of threads Number of upload and download threads for transferring data from and to Hadoop.

The default value is 1.

Not applicable when you select HDFS type Communication protocol.

Authentication type Authentication for the HDFS connection.

● None: Kerberos security is not enabled.

For Kerberos enabled cluster:

● Delegation token: You have a delegation token for au­thentication of the request.

● Kerberos: Default. You have a password to enter in the Password option.

● Kerberos keytab: You have a generated keytab file. With this option, you do not enter a value for Password, but you enter a location for Keytab file.

Keytab file Generated keytab file name.

Applicable when you select Kerberos keytab for Authentication type.

Kerberos Password Kerberos password.

Applicable when you select Kerberos for Authentication type.

SSL enabled Select Yes to use a Secure Socket Layer (SSL) connection to HDFS.

Not applicable when you select WebHDFS type for Communication protocol.

File System section

Remote directory Path for your HDFS working directory.

Local directory Path for your local working directory.

30 PUBLICSupplement for Hadoop

Connect to HDFS

Page 31: Supplement for Hadoop

Option Description

Replication factor The number of replicated files that HDFS should create.

The default value is 2.

Not applicable when you select HDFS type for Communication protocol.

Proxy section: Complete the Proxy options only when you are using a proxy.

Proxy host Path and host name for the REST API proxy server.

Not applicable when you select HDFS type for Communication protocol.

Proxy port Port number for the REST API Proxy server.

Not applicable when you select HDFS type for Communication protocol.

Proxy Username User name for the REST API proxy server.

Not applicable when you select HDFS type for Communication protocol.

Proxy Password Password for the REST API proxy server.

Not applicable when you select HDFS type for Communication protocol.

HDFS Proxy user Proxy user name configured for the HDFS user.

Not applicable when you select HDFS type for Communication protocol.

Pig section

Working directory Directory path or variable. The software uses this directory when transferring data from the remote server to the local server, and when transferring data from the local server to the remote server.

Applicable only for HDFS type Communication protocol.

Clean up working directory Determines if the software deletes files in the working direc­tory after execution.

● Yes: Default setting. Deletes the working directory files.● No: Preserves the working directory files.

If you select No, intermediate files remain in both this work­ing directory and the Data Services directory <$LINK_DIR>/log/hadoop.

Applicable only for HDFS type Communication protocol.

Supplement for HadoopConnect to HDFS PUBLIC 31

Page 32: Supplement for Hadoop

Option Description

Custom Pig script Directory path or variable. Location of a custom Pig script, if applicable.

Applicable only for HDFS type Communication protocol.

Knox

Gateway and topology URL to your gateway and topology file. The topology file specifies the Hadoop cluster services that the Knox gateway accesses. Supports only WebHDFS communication protocol.

Server certificate Location of your Knox certificate. If you leave this option blank, Data Services establishes an unsecured connection.

Related Information

File location common optionsFile location objectPrerequisites to Data Services configuration [page 15]Setting up HDFS and Hive on Windows [page 21]Connect to HDFS [page 27]

5.2 HDFS file format objects

An HDFS file format stores connection information to an HDFS file.

Use a file format to connect to source or target data when the data is stored in a file instead of a database table. To use an HDFS file format, do the following:

● Create a file format that defines the structure for a file.● Drag and drop a file format into a data flow and specify whether it is a source or target.● Specify connection information in the source or target file format editor.

HDFS file format options [page 33]Create a Hadoop distributed file system (HDFS) file format in the File Format Editor in SAP Data Services.

Configuring custom Pig script results as source [page 38]Use an HDFS file format and a custom Pig script to use the results of the PIG script as a source in a data flow.

Previewing HDFS file data [page 39]Preview HDFS file data for delimited and fixed width file types.

32 PUBLICSupplement for Hadoop

Connect to HDFS

Page 33: Supplement for Hadoop

Parent topic: Connect to HDFS [page 27]

Related Information

HDFS file location objects [page 27]

HDFS file format options [page 33]Previewing HDFS file data [page 39]

5.2.1 HDFS file format options

Create a Hadoop distributed file system (HDFS) file format in the File Format Editor in SAP Data Services.

Access the following options in the source or target file editors when you use the HDFS file format in a data flow. Mode refers to creating a new file format, editing a file format, completing source options, or completing target options. The options in the following table appear in all modes.

Option Possible values Description Mode

Data File(s)

NameNode host Computer name, fully quali­fied domain name, IP ad­dress, or variable

Name of the NameNode computer.

If you use the following de­fault settings, the local Ha­doop system uses what is set as the default file system in the Hadoop configuration files.

● NameNode Host: default● NameNode port: 0

All

NameNode port Positive integer or variable Port on which the NameNode listens.

If you use the following de­fault settings, the local Ha­doop system uses what is set as the default file system in the Hadoop configuration files.

● NameNode Host: default● NameNode port: 0

All

Supplement for HadoopConnect to HDFS PUBLIC 33

Page 34: Supplement for Hadoop

Option Possible values Description Mode

Hadoop user Alphanumeric characters and underscores or variable

Hadoop user name.

If you use Kerberos authenti­cation, include the Kerberos realm in the user name. For example: [email protected].

All

Authentication KerberosKerberos keytab

Indicates the type of authen­tication for the HDFS con­nection. Select either value for Hadoop and Hive data sources when they are Ker­beros enabled.

Kerberos: Select when you have a password to enter in the Password option.

Kerberos keytab: Select when you have a generated keytab file. With this option, you do not need to enter a value for Password, but you enter a location for File Location.

A Kerberos keytab file con­tains a list of authorized users for a specific pass­word. The software uses the keytab information instead of the entered password in the Password option. For more information about keytabs, see the MIT Kerberos docu­mentation at http://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html .

All

34 PUBLICSupplement for Hadoop

Connect to HDFS

Page 35: Supplement for Hadoop

Option Possible values Description Mode

File Location File path Location for the applicable Kerberos keytab that you generated for this connec­tion.

NoteThis option is only availa­ble when you choose Kerberos keytab for the Authentication.

All

Password Alphanumeric characters and underscores or variable

Password associated with the selected authentication type.

This field is required for Authentication type Ker­beros. This field is not appli­cable for Authentication type Kerberos keytab.

All

Root directory Directory path or variable Root directory path or varia­ble name for the output file.

All

File name(s) Alphanumeric characters and underscores or variable

Select the source connection file name or browse to the file by clicking the dropdown ar­row. For added flexibility, you can select a variable for this option or use the * wildcard.

All

Pig

Working directory Directory path or variable The Pig script uses this direc­tory to store intermediate data.

NoteWhen you leave this op­tion blank, Data Services creates and uses a direc­tory in /user/sapds_temp, within the HDFS.

All

Supplement for HadoopConnect to HDFS PUBLIC 35

Page 36: Supplement for Hadoop

Option Possible values Description Mode

Clean up working directory Yes, No Yes: Deletes working direc­tory files

No: Preserves working direc­tory files

The software stores the Pig output file and other inter­mediate files in the working directory. Files include scripts, log files, and the <LINK_DIR>/log/hadoop directory.

NoteIf you select No, inter­mediate files remain in both the Pig Working Di­rectory and the Data Services directory <LINK_DIR>/log/hadoop.

All

36 PUBLICSupplement for Hadoop

Connect to HDFS

Page 37: Supplement for Hadoop

Option Possible values Description Mode

Custom Pig script Directory path or variable Location of a custom Pig script.

Use the results of the script as a source in a data flow.

Custom Pig script can con­tain any valid Pig Latin com­mand, including calls to any MapReduce jobs that you want to use with Data Services. See your Pig docu­mentation for information about Pig Latin commands.

Custom Pig scripts must re­side on and be runnable from the local file system that con­tains the Data Services Job Server that is configured for Hadoop. It is not the Job Server on HDFS. Any external reference or dependency in the script should be available on the Data Services Job Server machine configured for Hadoop.

To test your custom Pig script, execute the script from the command prompt and check that it finishes without errors. For example, you could use the following command:

$ pig -f myscript

Use the results of the Pig script as source in a data flow by using the HDFS file format as a source in a data flow.

All

Locale

Supplement for HadoopConnect to HDFS PUBLIC 37

Page 38: Supplement for Hadoop

Option Possible values Description Mode

Code page <default>

us-ascii

The applicable Pig code page.

The Default option uses UTF-8 for the code page. Se­lect one of these options for better performance.

NoteFor other types of code pages, Data Services uses HDFS API-based file reading.

All

Related Information

Configuring custom Pig script results as source [page 38]

5.2.2 Configuring custom Pig script results as source

Use an HDFS file format and a custom Pig script to use the results of the PIG script as a source in a data flow.

Create a new HDFS file format or edit an existing one. Use the Pig section of the HDFS file format to create or locate a custom Pig script that outputs data.

Follow these steps to use the results of a custom Pig script in your HDFS file format as a source:

1. In the HDFS file format editor, select Delimited for Type in the General section.2. Enter the location for the custom Pig script results output file in Root directory in the Data File(s) section.3. Enter the name of the file to contain the results of the custom Pig script in File name(s).4. In the Pig section, set Custom Pig script to the path of the custom Pig script. The location must be on the

machine that contains the Data Services Job Server.5. Complete the applicable output schema options for the custom Pig script.6. Set the delimiters for the output file in the Delimiters section.7. Save the file format.

Use the file format as a source in a data flow. When the software runs the custom Pig script in the HDFS file format, the software uses the script results as source data in the job.

Related Information

HDFS file format objects [page 32]

38 PUBLICSupplement for Hadoop

Connect to HDFS

Page 39: Supplement for Hadoop

5.2.3 Previewing HDFS file data

Preview HDFS file data for delimited and fixed width file types.

To preview the first 20 or so rows of an HDFS file:

1. Right-click an HDFS file name in the Format tab of the Local Object Library2. Click Edit.

The File Format Editor opens. You can only view the data. Sorting and filtering are not available when you view sample data in this manner.

Use one of the following methods to access HDFS file data so that you can view, sort, and filter the data:

● Right-click on HDFS source or target object in a data flow and click View Data.● Click the magnifying glass icon located in the lower right corner of the HDFS source or target objects in the

data flow.● Right-click an HDFS file in the Format tab of the Local Object Library, click Properties, and then open the

View Data tab.

NoteBy default, the maximum number of rows displayed for data preview and filtering is 1000. Adjust the number lower or higher, up to a maximum of 5000. Perform the following steps to change the maximum number of rows to display:

1. Select Tools Options Designer General .2. Set the View data sampling size (rows) to the desired number of rows.

Related Information

Designer Guide: Viewing and adding filtersDesigner Guide: Sorting

Supplement for HadoopConnect to HDFS PUBLIC 39

Page 40: Supplement for Hadoop

6 Connect to Hive

To connect to the remote Hive server, you create a Hive database datastore or a Hive adapter datastore.

Before you create a Hive adapter datastore, create the adapter in the Administrator module of the Management Console. For details about the Hive adapter, see the Supplement for Adapters. Use the Hive adapter datastore when SAP Data Services is installed within the Hadoop cluster. Use the Hive adapter datastore for server-named (DSN-less) connections. Also include SSL (or the newer Transport Layer Security TLS) for secure communication over the network.

Use a Hive database datastore when Data Services is installed on a machine either within the Hadoop cluster or not. Use the Hive database datastore for a DSN or a DSN-less connection. Also include SSL (or the newer TLS) for secure communication over the network.

NoteSAP Data Services supports Apache Hive and HiveServer2 version 0.11 and higher. Support for DSN-less connections for Hive datastores begins with Data Services 4.2.12. For the most recent compatibility information, see the Product Availability Matrix on the SAP Support Portal.

Hive datastores [page 41]Use a Hive datastore to connect to a Hive server and work with tables in Hadoop.

Pushing the JOIN operation to Hive [page 56]Stage non-Hive data in a dataflow with the Data Transfer transform before joining it with a Hive source.

About partitions [page 57]SAP Data Services imports Hive partition columns the same way as regular columns, but displays the partition columns at the end of the table column list.

Metadata mapping for Hive [page 58]SAP Data Services matches Hive metadata data types to supported data types when it reads data from Hive adapter sources.

Hive data type conversion [page 59]SAP Data Services converts some Apache Hive data types when importing data and when loading data into external tables or files.

Configuring Kerberos authentication for Hive connection [page 60]Data Services supports Kerberos authentication for Hadoop and Hive data sources when you use Hadoop and Hive services that are Kerberos enabled.

Related Information

Hive adapter installation and configuration [page 42]Hive adapter datastore configuration options [page 47]Creating a DSN connection with SSL protocol in Windows [page 51]

40 PUBLICSupplement for Hadoop

Connect to Hive

Page 41: Supplement for Hadoop

6.1 Hive datastores

Use a Hive datastore to connect to a Hive server and work with tables in Hadoop.

With a Hive datastore, import Hive tables and use them as sources or targets in data flows.

NoteData Services supports Apache Hive and HiveServer2 version 0.11 and higher. For the most recent compatibility information, see the Product Availability Matrix (PAM) on the SAP Support Portal.

Which type of Hive datastore to use

SAP Data Services supports two types of Hive datastores: Hive adapter datastore and Hive database datastore.

Hive database datastore

Use a Hive database datastore when SAP Data Services is installed on a machine either within the Hadoop cluster or not. Use the Hive database datastore for a DSN or a DSN-less connection. Also include SSL (or the newer TLS) for secure communication over the network.

Hive adapter datastore

Use the Hive adapter datastore when Data Services is installed within the Hadoop cluster. Use the Hive adapter datastore for server-named (DSN-less) connections. Also include SSL (or the newer TLS) for secure communication over the network.

The Hive adapter datastore requires a supported Hive ODBC driver, such as Cloudera, to connect remotely to the Hive Server.

For more information about configuring and using adapters, see the Supplement for Adapters.

Hive adapter installation and configuration [page 42]Install and configure the Hive adapter instance before you run jobs using information from the Hive adapter.

Hive database datastores [page 49]Use a Hive database datastore to access data in Hadoop through the HiveServer2.

Parent topic: Connect to Hive [page 40]

Supplement for HadoopConnect to Hive PUBLIC 41

Page 42: Supplement for Hadoop

Related Information

Pushing the JOIN operation to Hive [page 56]About partitions [page 57]Metadata mapping for Hive [page 58]Hive data type conversion [page 59]Configuring Kerberos authentication for Hive connection [page 60]

6.1.1 Hive adapter installation and configuration

Install and configure the Hive adapter instance before you run jobs using information from the Hive adapter.

NoteBeginning with SAP Data Services 4.1 SP1, the Data Services installer automatically installs the Hive adapter.

After you install the Hive adapter, create an adapter instance and set adapter operations in the Data Services Management Console Administrator.

For complete information, see Adding an adapter instance in the Supplement for Adapters.

As you create the Hive adapter instance, perform the following steps specific to the Hive adapter:

● Download and install the JDBC driver directly from the Hive (Hadoop) official Web site. Make sure that the driver version matches the exact driver listed in the Product Availability Matrix for Hive.

● Save the driver to any folder that you've designated for the Hive JDBC driver. When you complete the Classpath option in the Management Console Administrator, append the full path of the JDBC driver, including the folder and JAR file name.

● Open the Adapter Status tab and start the Hive adapter instance.

After you create a Hive adapter instance, create a Hive adapter datastore in Data Services.

SSL connection support for Hive adapter [page 43]SAP Data Services supports SSL connections for the Hive adapter.

Hive adapter source options [page 43]Set specific source options when you use a Hive adapter datastore table as a source in a data flow.

Hive adapter target options [page 44]Set specific Hive target options in the target editor when you use a Hive adapter datastore table as a target in a data flow.

Hive adapter datastore support for SQL function and transform [page 45]The Hive adapter datastore can process data using SQL functions and the SELECT statement in a SQL transform.

Previewing Hive table data [page 46]After you import Hive table metadata using a Hive datastore, preview data in Hive tables.

Using Hive template tables [page 46]You can use Hive template tables as targets in your data flow.

42 PUBLICSupplement for Hadoop

Connect to Hive

Page 43: Supplement for Hadoop

Hive adapter datastore configuration options [page 47]To configure a Hive adapter datastore, include connection information to your data in Hadoop.

Parent topic: Hive datastores [page 41]

Related Information

Hive database datastores [page 49]

6.1.1.1 SSL connection support for Hive adapter

SAP Data Services supports SSL connections for the Hive adapter.

The SSL connection support is through a database connection. For a database connection, configure SSL options on the Data Services server side. Provide Data Services with the necessary certificates.

Data Services automatically includes certificates in its Java keystore so that it recognizes an adapter datastore instance as a trusted Web site. However, if there’s an error regarding a certificate, manually add a certificate back into the Java keystore. For instructions, see the Supplement for Adapters.

Related Information

SSL connection support

6.1.1.2 Hive adapter source options

Set specific source options when you use a Hive adapter datastore table as a source in a data flow.

Open the Adapter Source tab of the source table editor and complete the options. The following table contains Hive-specific options.

Supplement for HadoopConnect to Hive PUBLIC 43

Page 44: Supplement for Hadoop

Hive adapter source option descriptions

Option Description

Clean up working directory Specifies whether Data Services cleans up the working di­rectory after the job completes.

● True: Deletes working directory after successful job completion.

● False: Doesn't delete the working directory after suc­cessful job completion.

Execution engine type Specifies the type of engine to use for executing the job.

● Default: Uses the default Hive engine.● Spark: Uses the Spark engine to read data from Spark.● Map Reduce : Uses the Map Reduce engine to read data

from Hive.

Parallel process threads Specifies the number of threads for parallel processing.

More than one thread can improve performance by maximiz­ing CPU usage on the Job Server computer. For example, if you have four CPUs, enter 4 for the number of parallel proc­ess threads.

Related Information

Parallel process threads for flat files

6.1.1.3 Hive adapter target options

Set specific Hive target options in the target editor when you use a Hive adapter datastore table as a target in a data flow.

Open the Adapter Target tab of the target table editor and complete the Hive-specific options as described in the following table.

Hive adapter target option descriptions

Option Description

Append Specifies whether Data Services appends new data to the ta­ble or partition.

● True: Adds new data to the existing data in the table or partition.

● False: Deletes all existing data and adds new data to the table or partition.

44 PUBLICSupplement for Hadoop

Connect to Hive

Page 45: Supplement for Hadoop

Option Description

Clean up working directory Specifies whether Data Services cleans up the working di­rectory after the job completes.

● True: Deletes working directory after successful job completion.

● False: Doesn't delete the working directory after suc­cessful job completion.

Dynamic partition Specifies whether Hive evaluates the table partitions when it scans data before loading.

● True: Uses table partitions when scanning data before loading.

● False: Uses static partitions for loading data.

SAP Data Services supports only all-dynamic or only all-static partitions.

Drop and re-create table before loading Specifies whether to drop the existing table and create a new table with the same name before loading.

● True: Drops existing table and creates a new table be­fore loading data.

● False: Doesn't drop the existing table, but uses the exist­ing table for loading data.

The Drop and re-create table before loading option is applica­ble only when you use template tables in the design or test environment.

Number of loaders Specifies the number of loaders (threads) to run in parallel for loading data to the target table.

Specify a non-negative integer. The default is 1.

There are two types of loaders based on the number you en­ter:

● Single loader loading: Loading with one loader.● Parallel loading: Loading when the number of loaders is

greater than one.

6.1.1.4 Hive adapter datastore support for SQL function and transform

The Hive adapter datastore can process data using SQL functions and the SELECT statement in a SQL transform.

When you use a Hive table in a data flow, use the SQL transform and add SQL functions to manipulate the data in the table.

Supplement for HadoopConnect to Hive PUBLIC 45

Page 46: Supplement for Hadoop

● Use the SQL Transform to select specific data from the Hive table to process.

NoteThe SQL transform supports only a single SELECT statement. Also, SAP Data Services does not support SELECT for table columns with a constant expression.

● Use a sql() function to manipulate data in the following ways:○ Create, drop, or INSERT Hive tables○ Return a single string value from a Hive table○ Select a Hive table that contains aggregate functions (max, min, count, avg, and sum)○ Perform inner and outer joins

6.1.1.5 Previewing Hive table data

After you import Hive table metadata using a Hive datastore, preview data in Hive tables.

To preview Hive table data, first import table metadata using the Hive datastore. Then, right-click a Hive table name in the SAP Data Services Designer object library and click View Data.

Alternatively, click the magnifying glass icon on Hive source and target objects in a data flow or open the View Data tab of the Hive table view.

NoteThe ability to preview Hive table data is available only with Apache Hive version 1.1 and later.

For more information about how to use the View Data tab, see the Designer Guide.

Related Information

Using View Data

6.1.1.6 Using Hive template tables

You can use Hive template tables as targets in your data flow.

Ensure that the Hive adapter datastore is correctly configured in both SAP Data Services Management Console and SAP Data Services. To add a Hive template table as a target, start to create a data flow in Data Services Designer and perform the following steps:

1. Add a Hive template table object to the data flow.

Use one of two methods:○ Select a template table icon from the toolbar at right and click anywhere in the data flow in the

workspace.

46 PUBLICSupplement for Hadoop

Connect to Hive

Page 47: Supplement for Hadoop

○ Expand the Template node under the applicable Hive adapter datastore in the object library and drag and drop a template table onto your workspace. Note that the template table has to already exist before it is in the object library.

The Create Template dialog box opens.2. Enter a template table name in Template name.3. Select the applicable Hive datastore name from the In datastore dropdown list.4. Enter the Hive dataset name in Owner name.5. Select the format of the table from the Format dropdown list.6. Click OK to close the Create Template dialog box.7. Connect the Hive template table to the data flow.8. Click the template table target icon in the data flow to open the target editor.9. Open the Target tab and set applicable options.

The software completes the input and output schema areas based on the schema in the stated Hive dataset.

10. Save your changes and execute the applicable job.

Data Services opens the applicable Hive project and dataset, and creates the table. The table name is the name that you entered for Template name in the Create Template window. Data Services populates the table with the data generated from the data flow.

6.1.1.7 Hive adapter datastore configuration options

To configure a Hive adapter datastore, include connection information to your data in Hadoop.

The following table contains descriptions for the datastore configuration options that apply to the Hive adapter datastore.

Hive adapter datastore option descriptions

Option Description

Datastore Type Select Adapter.

Adapter Instance Name Select the specific instance that you created in the Management Console.

Advanced options

User name Specifies the user name associated with the data to which you are connecting.

If you select Kerberos for the Authentication, include the Ker­beros realm with the user name. For example: dsuser@BIG­DATA.COM.

If you select Kerberos keytab for the Authentication, do not complete the User name option.

Password Specifies the password associated with the data to which you are connecting.

Supplement for HadoopConnect to Hive PUBLIC 47

Page 48: Supplement for Hadoop

Option Description

Local working directory Specifies the path to your local working directory.

HDFS working directory Specifies the path to your Hadoop Distributed File System (HDFS) directory. If you leave this blank, Data Services uses /user/sapds_hivetmp as the default.

NoteIf you use Beeline CLI, enter the directory that your ad­ministrator created, and assign permission 755 to each directory in the path.

String size Specifies the size of the Hive STRING datatype. The default is 100.

SSL enabled Specifies whether to use SSL (Secure Socket Layer), or the newer Transport Layer Security (TLS), for secure communi­cation over the network.

Select Yes to use an SSL connection to connect to the Hive server.

NoteIf you use Kerberos or Kerberos keytab for authentica­tion, set this option to No.

SSL Trust Store Specifies the path and file name of the trust store that veri­fies credentials and stores certificates.

Trust Store Password Specifies the password associated with the trust store.

Authentication Indicates the type of authentication you are using for the Hive connection:

● None● Kerberos● Kerberos keytab

NoteComplete the remaining Kerberos options based on your selection for Authentication.

48 PUBLICSupplement for Hadoop

Connect to Hive

Page 49: Supplement for Hadoop

Option Description

Additional Properties Specifies additional connection properties.

For multiple property value pairs, use a semicolon as a de­limiter between pairs. End the string of property values with a semicolon.

Examplename1=value1;name1=value1; name2=value2;

To enable SASL-QOP support, set the Authentication option to Kerberos. Then enter one of the following values, which should match the value on the Hive server:

● Authentication only: ;sasl.qop=auth;● Authentication with integrity protec­

tion: ;sasl.qop=auth-int;● Authentication with integrity and confidentiality protec­

tion:;sasl.qop=auth-conf;

6.1.2 Hive database datastores

Use a Hive database datastore to access data in Hadoop through the HiveServer2.

Use a Hive database datastore for the following tasks:

● Import Hadoop tables and use them as sources and targets in data flows● Use a Hive template table in your data flow.● Preview data from tables.

Configure a Hive database datastore with either a DSN or DSN-less connection, and include SSL encryption. Additionally, select one of the following Hive authentications:

● User name and password● Kerberos● User name● No authentication

Supported Hive ODBC Drivers

The Hive database datastore supports the following ODBC drivers:

● Cloudera● Hortonworks

Supplement for HadoopConnect to Hive PUBLIC 49

Page 50: Supplement for Hadoop

● MapR

For more information about the specific driver versions currently supported, see the Product Availability Matrix (PAM) on the SAP Support Portal.

Limitations

● Operations such as DELETE and UPINSERT are not natively supported by the HiveServer2.● Parameterized SQL is not supported; the HiveServer2 does not support the parameter marker.

Configuring ODBC driver in Windows [page 50]For Windows, configure the Hive ODBC driver using the ODBC Drivers Selector utility.

Creating a DSN connection with SSL protocol in Windows [page 51]Use the ODBC Data Source Administrator to create a DSN connection to use with a Hive database datastore, and optionally configure SSL/TLS protocol.

Configuring ODBC driver with SSL protocol for Linux [page 52]For Linux, use the SAP Data Services Connection Manager to configure the ODBC driver and configure SSL protocol for the Hive database datastore.

Configuring bulk loading for Hive [page 53]Use a combination of Hadoop objects to configure bulk loading for Hive targets in a data flow.

Updating Hive target tables using Update or Merge SQL [page 54]You can update Hive target tables using Update or Merge SQL.

Hive database datastore option descriptions [page 55]Complete options in the Hive database datastore to configure connection types, authorizations, and SSL security protocol settings.

Parent topic: Hive datastores [page 41]

Related Information

Hive adapter installation and configuration [page 42]

6.1.2.1 Configuring ODBC driver in Windows

For Windows, configure the Hive ODBC driver using the ODBC Drivers Selector utility.

Download and install a supported ODBC driver for Hive. Supported drivers include the following:

● Cloudera● Hortonworks● MapR

50 PUBLICSupplement for Hadoop

Connect to Hive

Page 51: Supplement for Hadoop

Obtain driver downloads and related information by going to the product Web page for the driver type that you select.

1. Open the ODBC Drivers Selector utility located in <DS_COMMON_DIR>\bin\ODBCDriversSelector.exe.

2. Select the ODBC driver for Hive server under the Database versions column.3. Click the corresponding cell under the ODBC Drivers column and select the correct driver from the

dropdown list.4. Look at the value in the State column. If the status doesn't appear in the State column, click in the empty

cell and the applicable status appears.

The state should be “Installed.” If the state is anything other than “Installed,” you may not have properly installed the ODBC driver. Exit the ODBC Drivers Selector utility and check your driver installation for errors. After you correct the errors, or reinstall the ODBC driver, repeat the steps to configure the driver.

6.1.2.2 Creating a DSN connection with SSL protocol in Windows

Use the ODBC Data Source Administrator to create a DSN connection to use with a Hive database datastore, and optionally configure SSL/TLS protocol.

Perform the following prerequisites before you configure a DSN connection with SSL protocol:

● Download and install a supported ODBC driver. For information, see Configuring ODBC driver in Windows [page 50].

● Generate an SSL certificate and key file by following the instructions in your Hive documentation.● Access the ODBC Data Source Administrator in one of two ways:

○ Create the Hive database datastore, select Use data source name (DSN), and click ODBC Admin....○ Open the ODBC Data Source Administrator using the Start menu in Windows.

1. In the ODBC Data Source Administrator, select the applicable tab and click Add.

Tabs: Select User DSN to create a DSN that is visible only to you. Select System DSN to create a DSN that is available to all users.

2. Select the applicable Hive ODBC driver from the list of drivers and click Finish.3. Complete the applicable options in the DSN Setup dialog box.

ExampleIf you installed the Cloudera ODBC driver, the dialog box is Cloudera ODBC Driver for Apache Hive DSN Setup.

The options to complete are based on the type of ODBC driver you use, and the service discovery mode.4. Optional. Perform the following substeps to configure SSL protocol for your connection to Hive:

a. Click SSL Options located at the bottom of the DSN Setup dialog box. The SSL Options dialog box opens.

b. Select to enable SSL and enter values for the SSL options as applicable.

For descriptions of the SSL options, see Hive database datastore option descriptions [page 55].c. Click Apply.

Supplement for HadoopConnect to Hive PUBLIC 51

Page 52: Supplement for Hadoop

5. Click OK to close the ODBC Data Source Administrator and save your DSN.

Create the Hive database datastore and select the DSN that you just created.

6.1.2.3 Configuring ODBC driver with SSL protocol for Linux

For Linux, use the SAP Data Services Connection Manager to configure the ODBC driver and configure SSL protocol for the Hive database datastore.

Perform the following prerequisites before configuring the ODBC driver:

● Download and install a supported ODBC driver. Supported drivers include the following:○ Cloudera○ Hortonworks○ MapR

● Generate an SSL certificate and key file following instructions in your Hive documentation.● The Connection Manager is a command-line utility. However, install the GTK+2 library to make a UI for

Connection Manager. For more information about obtaining and installing GTK+2, see https://www.gtk.org/ . The following instructions assume that you have the user interface for Connection Manager.

1. Open the Connection Manager by entering the following command:

$ cd $LINK_DIR/bin/ $ ./DSConnectionManager.sh

2. Select to configure the ODBC driver and enter values for the following information when the utility prompts you:

○ Location of the ODBC inst file○ Driver version○ Driver name○ User name for the database○ User password for the database○ Driver location and file name○ Host name○ Port number○ UNIX ODBC library path

3. Press Enter to go to the main menu.

4. Select to create the SSL protocol and enter values for the following parameters when the utility prompts you:

○ Hive authentication: noauth, kerberos, user, or user-passwd.○ Hive SSL mode: disabled or enabled○ Hive SSL Server Certificate File○ Hive two way SSL○ Hive SSL Client Certificate File

52 PUBLICSupplement for Hadoop

Connect to Hive

Page 53: Supplement for Hadoop

○ Hive SSL Client Key File○ Hive SSL Client Key Password

NoteFor descriptions for the SSL parameters, see Hive database datastore option descriptions [page 55].

6.1.2.4 Configuring bulk loading for Hive

Use a combination of Hadoop objects to configure bulk loading for Hive targets in a data flow.

Create the following objects:

● HDFS file location object● HDFS file format● Hive database datastore

To set up bulk loading to Hive, follow these steps:

1. Open the Format tab in the Local Object Library and expand the HDFS Files node.2. Select the HDFS file format that you created for this task and drag it onto your data flow workspace.3. Select Make Source.4. Add the applicable transform objects to your data flow.5. Add a template table as a target to the data flow:

a. Select the template table icon from the tool palette at right.b. Click on a blank space in your data flow workspace

The Create Template dialog box opens6. Complete Template name with a new name for the target.7. Select the Hive database datastore that you created for this task from the In datastore dropdown list.8. Select a format from the Formats dropdown list.9. Click OK.10. Connect the template to the data flow.11. In your data flow workspace, open the target table and open the Bulk Loader Options tab.

The Bulk Load option is selected by default.12. Select a mode from the Mode dropdown list.

Because the target is a newly-created table, there is no data in the table. However, if you use the data flow in subsequent runs, the Mode affects the data in the target table.○ Append: Adds new records generated from Data Services processing to the existing data in the target

table.○ Truncate: Replaces all existing records in the existing target table with the records generated from Data

Services processing.13. Select the HDFS file location object that you created for this task from the HDFS File Location drop-down

list.14. Complete the remaining target options as applicable.

Supplement for HadoopConnect to Hive PUBLIC 53

Page 54: Supplement for Hadoop

Related Information

Data flowsApache Hadoop [page 9]

6.1.2.5 Updating Hive target tables using Update or Merge SQL

You can update Hive target tables using Update or Merge SQL.

You can update Apache Hive Optimized Row Columnar (ORC) tables when using bulk loading and Hive version 1.2 and higher. This functionality is not available for template tables because bulk loading is not available.

Data Services first loads all of the changes to a staging table, using the existing Hive bulk load options and HDFS file location, and then applies the changes from the staging table to the target table by executing UPDATE or MERGE SQL.

To trigger bulk modification, the Bulk Load option must be enabled. In addition, one of the following must be true:

● Your data flow contains a Map_CDC_Operation transform.● Your data flow contains a Map_Operation transform that outputs Delete or Update rows.● The Auto correct load option is enabled.

If these requirements are not met, Data Services assumes that the input data contains only INSERT rows and it performs only a bulk insert. Data Services does not use a staging table or execution of additional SQL.

Hive ACID transaction support

For Hive ACID (Atomicity Consistency Isolation Durability) transaction support, the table you are using should be bucketed and clustered by the expected column you want treated as the primary key. In addition, the table format must be ORC.

You also need to set the following configuration parameters under Hive:

Configuration parameter Value

hive.support.concurrency true

hive.enforce.bucketing true

This is not required as of Hive 2.0.

hive.exec.dynamic.partition.mode nonstrict

hive.txn.manager org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

54 PUBLICSupplement for Hadoop

Connect to Hive

Page 55: Supplement for Hadoop

Configuration parameter Value

hive.compactor.initiator.on true

This needs to be set for exactly one instance of the Thrift metastore service.

hive.compactor.worker.threads <positive number>

This needs to be set on at least one instance of the Thrift metastore service.

If you are using a Hive table in ACID (INSERT, UPDATE, DELETE), the table property transactional=true must be set on that table.

For example:

create table test_partition(col1 int, col2 string) clustered by (col1) into 5 buckets stored as orc tbl_properties("transactional=true");

6.1.2.6 Hive database datastore option descriptions

Complete options in the Hive database datastore to configure connection types, authorizations, and SSL security protocol settings.

The following table contains datastore option descriptions specific to a DSN-less Hive database datastore.

Hive database datastore option descriptions

Option Description

Datastore Type Select Database.

Database Type Select Hive.

Database Subtype Select Hive Server2.

Database Version Select the applicable version.

The following options appear when you create a server-named (DSN-less) connection.

Database server name Specifies the server for the client.

Hive Authentication Specifies the type of authentication to use to access the Hive Server.

● User Name and Password: Requires that you enter the authorized user for the Hive Server in User Name, and enter the related password in Password.

● Kerberos: Requires that you enter Kerberos authentica­tion information.

● User Name: Requires that you enter the authorized user.● No Authentication: No additional information required.

For more information about Kerberos, see Configuring Ker­beros authentication for Hive connection [page 60].

Supplement for HadoopConnect to Hive PUBLIC 55

Page 56: Supplement for Hadoop

Option Description

Complete the following options for SSL encryption. Applicable for DSN-less connection.

Use SSL encryption Select Yes.

Encryption Parameters Opens the Encryption Parameters dialog box.

To open, double-click in the empty cell next to the option or click the … icon that appears at the end of the cell when you place your cursor in the cell.

Complete the following options in the Encryption Parameters dialog box.

Allow Common Name Host Name Mismatch Specifies whether the CA certificate name can be different than the Hive Server host name.

Select this option to allow the CA certificate name to be dif­ferent than the Hive Server host name.

Allow Self-Signed Server Certificate Specifies whether to allow a certificate signer be the certifi-cate approver.

Select this option to allow the same signatures.

Trusted Certificates Specifies the path for the directory of Certificate Authority certificate files.

Two-Way SSL Specifies to allow a two-way SSL authentication.

Client Certificate File Specifies the location of the client certificate file.

Client Private Key File Specifies the location of the client private key file.

Client Private Key Password Specifies the password to access the client private key file.

6.2 Pushing the JOIN operation to Hive

Stage non-Hive data in a dataflow with the Data Transfer transform before joining it with a Hive source.

When you include a join operation in a data flow between Hive and non-Hive data, stage the Hive data before the operation for better performance. Staging data is more efficient because SAP Data Services doesn't have to read all the data from the Hive data source into memory before performing the join.

Before you stage the data, enable the Enable automatic data transfer option in the Hive datastore editor.

When you construct the data flow, add the Data_Transfer transform. Open the transform editor and make the following settings:

● Transfer Type = Table● Database type = Hive

56 PUBLICSupplement for Hadoop

Connect to Hive

Page 57: Supplement for Hadoop

CautionFor non-Hive relational databases: In the Data_Transfer transform, if the option Data Transfer Type is set to Automatic, disable the option Enable automatic data transfer. This rule applies to all relational databases except for Hive.

Parent topic: Connect to Hive [page 40]

Related Information

Hive datastores [page 41]About partitions [page 57]Metadata mapping for Hive [page 58]Hive data type conversion [page 59]Configuring Kerberos authentication for Hive connection [page 60]Data_Transfer transform for push-down operations

6.3 About partitions

SAP Data Services imports Hive partition columns the same way as regular columns, but displays the partition columns at the end of the table column list.

The column attribute Partition Column identifies whether the column is partitioned.

When loading to a Hive target, select whether or not to use the Dynamic partition option on the Adapter Target tab of the target table editor.

Hive evaluates the partitioned data dynamically when it scans the data. If Dynamic partition is not selected, Data Services uses Hive static loading, in which it loads all rows to the same partition. The partitioned data comes from the first row that the loader receives.

Parent topic: Connect to Hive [page 40]

Related Information

Hive datastores [page 41]Pushing the JOIN operation to Hive [page 56]Metadata mapping for Hive [page 58]Hive data type conversion [page 59]Configuring Kerberos authentication for Hive connection [page 60]

Supplement for HadoopConnect to Hive PUBLIC 57

Page 58: Supplement for Hadoop

Hive adapter target options [page 44]

6.4 Metadata mapping for Hive

SAP Data Services matches Hive metadata data types to supported data types when it reads data from Hive adapter sources.

The following table shows the conversion between Hive data types and Data Services data types when Data Services imports metadata from a Hive source or target.

Data type mapping between Hive and Data Services

Hive data type Data Services data type

tinyint int

smallint int

int int

bigint decimal(20,0)

float real

double double

string varchar

boolean varchar(5)

complex not supported

Parent topic: Connect to Hive [page 40]

Related Information

Hive datastores [page 41]Pushing the JOIN operation to Hive [page 56]About partitions [page 57]Hive data type conversion [page 59]Configuring Kerberos authentication for Hive connection [page 60]

58 PUBLICSupplement for Hadoop

Connect to Hive

Page 59: Supplement for Hadoop

6.5 Hive data type conversion

SAP Data Services converts some Apache Hive data types when importing data and when loading data into external tables or files.

The following table shows the conversion between Apache Hive data types and Data Services data types. Data Services converts Apache Hive data types to Data Services data types when you import metadata from an Apache Hive source or target into the repository. Data Services also converts data types back to Apache Hive data types when it loads data into an external table or file.

Hive data type Data Services data type Additional information

TINYINT INT

SMALLINT INT

INT/INTEGER INT

BIGINT DECIMAL(19,0) As default, the precision is 19.

FLOAT DOUBLE

DOUBLE DOUBLE

DECIMAL DECIMAL

VARCHAR VARCHAR

CHAR VARCHAR

STRING VARCHAR(255)

BOOLEAN INT

TIMESTAMP DATETIME

Date Date

INTERVAL Not Supported Available with Hive 1.2.0 and later

complex Not Supported Complex types are array, map, and so on.

If Data Services encounters a column that has an unsupported data type, it does not import the column. However, you can configure Data Services to import unsupported data types. In the applicable datastore, check the Import unsupported data types as VARCHAR of size checkbox located in the left corner of the datastore editor dialog box.

Parent topic: Connect to Hive [page 40]

Supplement for HadoopConnect to Hive PUBLIC 59

Page 60: Supplement for Hadoop

Related Information

Hive datastores [page 41]Pushing the JOIN operation to Hive [page 56]About partitions [page 57]Metadata mapping for Hive [page 58]Configuring Kerberos authentication for Hive connection [page 60]

6.6 Configuring Kerberos authentication for Hive connection

Data Services supports Kerberos authentication for Hadoop and Hive data sources when you use Hadoop and Hive services that are Kerberos enabled.

Configure Kerberos authentication in your Hive adapter or Hive database datastore with a DSN or server name (DSN-less) connection.

Note● You cannot use SSL and Kerberos or Kerberos keytab authentication together. Set the SSL enabled

option to No when using Kerberos authentication.● To enable SASL-QOP support for Kerberos, enter a sasl.qop value into the Additional Properties field

in the datastore editor.

Ensure that your Hive service is Kerberos-enabled and that you have the required Kerberos information to complete the configuration in the datastore.

To configure the datastore for Kerberos authentication, create a Hive datastore or edit an existing Hive datastore, and perform the following steps.

1. Select the authentication type.a. In the Hive adapter datastore editor, select Kerberos for Authentication.b. In the Hive database datastore editor, select Kerberos for Hive Authentication.

2. Complete the Kerberos options as described in the following table.

Kerberos option descriptions for Hive adapter datastore

Option Description

Kerberos Realm Specifies the name of your Kerberos realm. A realm con­tains the services host machines, application servers, and so on, that users can access. For example, BIGDATA.COM.

60 PUBLICSupplement for Hadoop

Connect to Hive

Page 61: Supplement for Hadoop

Option Description

Kerberos KDC Specifies the server name of the Key Distribution Center (KDC). The KDC database stores secret keys for user ma­chines and services.

Configure the Kerberos KDC with renewable tickets (ticket validity as required by Hadoop Hive installation).

NoteData Services supports MIT KDC and Microsoft AD for Kerberos authentication.

Kerberos Hive Principal The Hive principal name for the KDC. The name can be the same as the user name that you use when installing Data Services. Find the Hive service principal information in the hive-site.xml file. For example, hive/<hostname>/@realm.

Kerberos Keytab location Location for the applicable Kerberos keytab that you gen­erated for this connection.

A Kerberos keytab file contains a list of authorized users for a specific password. SAP Data Services uses the key­tab information instead of the entered password in the Username and Password option. For more information about keytabs, see the MIT Kerberos documentation on the Massachusetts Institute of Technology (MIT) Website.

Kerberos option descriptions for Hive database datastore

Option Description

Kerberos Service Name Specifies the name of the Kerberos service for Kerberos authentication.

Kerberos Host FQDN Specifies the Fully Qualified Domain Name (FQDN) for the Kerberos host.

Kerberos Realm Specifies the name of your Kerberos realm. A realm con­tains the services host machines, application servers, and so on, that users can access. For example, BIGDATA.COM.

3. Complete the remaining options in the datastore as applicable.

Task overview: Connect to Hive [page 40]

Related Information

Hive datastores [page 41]Pushing the JOIN operation to Hive [page 56]About partitions [page 57]

Supplement for HadoopConnect to Hive PUBLIC 61

Page 62: Supplement for Hadoop

Metadata mapping for Hive [page 58]Hive data type conversion [page 59]

62 PUBLICSupplement for Hadoop

Connect to Hive

Page 63: Supplement for Hadoop

7 Cloudera Data Platform (CDP) Private Cloud Base

CDP Private Cloud Base enables you to have the speed of cloud processing with consistent, on-premise security and governance.

Use SAP Data Services to access Hive and Impala data on your CDP Private Cloud Base.

NoteSupport for CDP Private Cloud Base begins with Data Services version 4.2 Support Pack 14 Patch 6 and CDP version 7.1. The required version for the ODBC driver for Hive and Impala begins with version 2.6.9.

For complete information about CDP Private Cloud Base, including upgrade information, see the Cloudera Docs, CDC Private Cloud Base .

For required CDP Private Cloud Base and ODBC driver versions, see the Product Availability Matrix (PAM). Select Open in New Window under Essential Information.

Checking your CDP Private Cloud Base connection [page 64]Before you work with your CDP Private Cloud Base, ensure that your connections enable you to access tables in Hive and Impala, and check files in HDFS.

Download CDP ODBC client driver [page 65]All processes with your Hive and Impala data through your CDP Private Cloud Base requires an applicable Cloudera ODBC client driver.

Configure CDP ODBC client driver in Linux [page 65]Configure the CDP ODBC client driver using the SAP Data Services Connection Manager.

Configure CDP Hive DSN-less connections in Windows [page 67]To configure a DSN-less connection for Hive on a Windows platform, use the ODBC Drivers Selector utility in SAP Data Services.

Configure CDP DSN on Windows [page 68]For Windows platform, use the CDP (Cloudera Data Platform) Private Cloud Base ODBC client driver to configure a DSN (data source name) for Apache Hive or Apache Impala.

Configure CDP DSN on Linux [page 69]For Linux platform, use the CDP (Cloudera Data Platform) Private Cloud Base ODBC client driver to configure a DSN (data source name) for Apache Hive or Apache Impala.

CDP datastore options for Apache Hive [page 72]To connect to your Apache Hive data, create an SAP Data Services datastore.

CDP datastore options for Apache Impala [page 73]To connect to your Apache Impala data, create an SAP Data Services datastore.

Configure CDP HDFS file location object [page 74]Use a Hadoop distributed file system (HDFS) file location to access your Hadoop data in your CDP (Cloudera Data Platform) Private Cloud Base for SAP Data Services processing.

Supplement for HadoopCloudera Data Platform (CDP) Private Cloud Base PUBLIC 63

Page 64: Supplement for Hadoop

7.1 Checking your CDP Private Cloud Base connection

Before you work with your CDP Private Cloud Base, ensure that your connections enable you to access tables in Hive and Impala, and check files in HDFS.

Before you check tables and files, ensure that you've completed the following tasks:

1. You have the credentials to access the basic CDP portal.

Note

For information about the CDP portal, see Cloudera Manager in your Cloudera Docs.

2. You have credentials to access CDP clusters.3. Check that your Hive Server2 and Impala services are running successfully.

To check that you can access and run commands in your configured environments, perform the following tasks:

1. Check tables in Hive:a. Open a Linux command.b. Enter your credentials to access CDP Clusters.c. Log into Hive and use Beeline to run commands on data in your Hive tables.

2. Check tables in Impala:a. Open a Linux command.b. Enter your user credentials to access CDP Clusters.c. Run impala-shell commands on your Impala tables.

3. Check files on HDFS:a. Open a Linux command.b. Run the following command:

hadoop fs -ls<dir>

c. Run additional commands using hadoop fs to check your HDFS files.

Task overview: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Download CDP ODBC client driver [page 65]Configure CDP ODBC client driver in Linux [page 65]Configure CDP Hive DSN-less connections in Windows [page 67]Configure CDP DSN on Windows [page 68]Configure CDP DSN on Linux [page 69]CDP datastore options for Apache Hive [page 72]CDP datastore options for Apache Impala [page 73]

64 PUBLICSupplement for Hadoop

Cloudera Data Platform (CDP) Private Cloud Base

Page 65: Supplement for Hadoop

Configure CDP HDFS file location object [page 74]

7.2 Download CDP ODBC client driver

All processes with your Hive and Impala data through your CDP Private Cloud Base requires an applicable Cloudera ODBC client driver.

Before you perform the following steps, consult the latest SAP Product Availability Matrix (PAM) for version information.

To download a Cloudera ODBC client driver, perform the following steps:

1. Open the Cloudera Enterprise Downloads Web page and enter your log in credentials.2. Select the Cloudera ODBC client driver:

○ Select either Windows or Linux platform○ Select driver for Hive or Impala○ Select applicable version

3. Select to download.

Save the ODBC client driver to an applicable location on your server.

After you download the ODBC client driver, configure the driver.

Task overview: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Checking your CDP Private Cloud Base connection [page 64]Configure CDP ODBC client driver in Linux [page 65]Configure CDP Hive DSN-less connections in Windows [page 67]Configure CDP DSN on Windows [page 68]Configure CDP DSN on Linux [page 69]CDP datastore options for Apache Hive [page 72]CDP datastore options for Apache Impala [page 73]Configure CDP HDFS file location object [page 74]

7.3 Configure CDP ODBC client driver in Linux

Configure the CDP ODBC client driver using the SAP Data Services Connection Manager.

Use the following steps for Hive or Impala drivers.

Supplement for HadoopCloudera Data Platform (CDP) Private Cloud Base PUBLIC 65

Page 66: Supplement for Hadoop

1. Open the Connection Manager by command or by opening the file DSConnectionManager.sh in your install directory.

By default, the DSConnectionManager.sh is located in <$LINK_DIR>/bin/.

2. Complete each prompt in the Connection Manager as applicable

The prompts are the same for Hive and Impala. The values you enter are based on whether you are configuring for Hive or Impala.

ExampleThe following code sample shows the Connection Manager prompts and values for configuring a Cloudera ODBC client driver for Hive.

We've added bold font to draw your attention to the selection and entered values.

*************************************************** ------------------Start Menu------------------Connection Manager is used to configure Data Sources or Drivers. 1: Configure Data Sources 2: Configure Drivers q: Quit ProgramSelect one command:'1' 2 ------------------Main Menu-Configure Drivers------------------Drivers: Index Database Type Version State Name ------ ------ ------ ------ ------ *current list of drivers The command list:Use Ctrl+B+Enter to go back to the main menu. a: Add a new driver e: Edit an existing driver d: Delete an existing driver s: Restart the Data Services Job Service and the EIM Adaptive Processing Server q: Quit to Start MenuSelect one command: a Specify one database type: 1) MySQL Driver for version [8.x, 5.7, 5.6, 5.5, 5.1, 5.0] 2) HANA Driver for version [2, 1] 3) Teradata Driver for version [16.20, 16.00, 15.10, 15.00, 14.10, 14.00, 13.10, 13.0, 12] 4) Netezza Driver for version [7, 6, 5, 4] 5) Sybase IQ Driver for version [15, 16] 6) Informix Driver for version [12, 11] 7) DB2 UDB Driver for version [11, 10, 9] 8) Oracle Driver for version [19, 18, 12, 11, 10, 9] 9) SQL Anywhere Driver for version [17, 16, 12] 10) Snowflake Driver for version [3.x] 11) Hive Server Driver for version [2.x.y, 2.1.2, 2.1.1] 12) PostgreSQL Driver for version [10.x] 13) Google BigQuery Driver 14) Amazon Redshift Driver for version [8.x]Specify database index #'1' 11 ********************************Configuration for Hive Server******************************** The ODBC inst file is <path_to_ini_file>/odbcinst.ini. Specify the Driver Version: <driver.version> Specify the Driver Name: <driver_name>

66 PUBLICSupplement for Hadoop

Cloudera Data Platform (CDP) Private Cloud Base

Page 67: Supplement for Hadoop

Specify the Driver: </path/lib>/64/libclouderahiveodbc64.so Specify the Unix ODBC Lib Path: /<lib_path>/unixODBC-<version>/lib Specify the Host Name: <server_host_name> Specify the Port:'<port_number>' Specify the User Name: <Hive user name> Specify the Hive Password Type[0:normal|1:file]:'0' Type database password:(no echo) *Password doesn't appear when you type it for security Retype database password:(no echo)Specify the Hive Auth Mech[0:noauth|1:kerberos|2:user|3:user-passwd]: 2 Specify the Hive SSL Mode[0:disabled|1:enabled]:'0' Testing connection...Successfully added driver.Press Enter to go back to the Main Menu.

Task overview: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Checking your CDP Private Cloud Base connection [page 64]Download CDP ODBC client driver [page 65]Configure CDP Hive DSN-less connections in Windows [page 67]Configure CDP DSN on Windows [page 68]Configure CDP DSN on Linux [page 69]CDP datastore options for Apache Hive [page 72]CDP datastore options for Apache Impala [page 73]Configure CDP HDFS file location object [page 74]

7.4 Configure CDP Hive DSN-less connections in Windows

To configure a DSN-less connection for Hive on a Windows platform, use the ODBC Drivers Selector utility in SAP Data Services.

Use the following steps for Hive DSN-less connections.

1. Open the ODBC Drivers Selector utility located in <DS_COMMON_DIR>\bin\ODBCDriversSelector.exe.

2. Find the row for Hive Server2 ODBC Driver under the Database versions column.3. Select Cloudera ODBC driver for Hive under the ODBC Driver column list.4. Look at the value in the State column. If the status doesn't appear in the State column, select the empty

cell. The applicable status appears.

Supplement for HadoopCloudera Data Platform (CDP) Private Cloud Base PUBLIC 67

Page 68: Supplement for Hadoop

The state should be “Installed”. If the state is anything other than “Installed”, you may not have properly installed the ODBC driver. Exit the ODBC Drivers Selector utility and check your driver installation for errors. After you correct the errors, or reinstall the ODBC driver, repeat the steps to configure the driver.

5. Select OK.

Task overview: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Checking your CDP Private Cloud Base connection [page 64]Download CDP ODBC client driver [page 65]Configure CDP ODBC client driver in Linux [page 65]Configure CDP DSN on Windows [page 68]Configure CDP DSN on Linux [page 69]CDP datastore options for Apache Hive [page 72]CDP datastore options for Apache Impala [page 73]Configure CDP HDFS file location object [page 74]

7.5 Configure CDP DSN on Windows

For Windows platform, use the CDP (Cloudera Data Platform) Private Cloud Base ODBC client driver to configure a DSN (data source name) for Apache Hive or Apache Impala.

Before you perform the following steps, download and configure the applicable Cloudera ODBC client driver.

NoteThe following process is for Hive but the process is similar for Apache Impala.

1. Open the Windows ODBC Data Sources application from the Windows Start menu.

If you create the DSN during datastore creation, access the ODBC Data Source application by selecting the ODBC Admin button in the datastore editor.

2. Open the System DSN tab.3. Select the Cloudera ODBC client driver for Hive from the list and select Configure.

The Cloudera ODBC Driver for Apache Hive DSN Setup dialog box opens.4. Complete the options as applicable.

The following table contains example options and values.

68 PUBLICSupplement for Hadoop

Cloudera Data Platform (CDP) Private Cloud Base

Page 69: Supplement for Hadoop

Option Value

Data Source Name Enter a unique name for this DSN

Description Optional

Hive Server Type Hive Server 2

Service Discovery Mode No Service Discovery

Host Enter the host name for your Hive server

Port Enter the port number for your Hive server

Database default

Mechanism Select the type of authentication you use:○ No authentication○ Kerberos○ User Name○ User Name and Password

Authentication Enter the credentials based on the Mechanism option that you select.

5. Optional: Select Test.6. Select OK.

Task overview: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Checking your CDP Private Cloud Base connection [page 64]Download CDP ODBC client driver [page 65]Configure CDP ODBC client driver in Linux [page 65]Configure CDP Hive DSN-less connections in Windows [page 67]Configure CDP DSN on Linux [page 69]CDP datastore options for Apache Hive [page 72]CDP datastore options for Apache Impala [page 73]Configure CDP HDFS file location object [page 74]

7.6 Configure CDP DSN on Linux

For Linux platform, use the CDP (Cloudera Data Platform) Private Cloud Base ODBC client driver to configure a DSN (data source name) for Apache Hive or Apache Impala.

Before you perform the following steps, download and configure the applicable Cloudera ODBC client driver.

Supplement for HadoopCloudera Data Platform (CDP) Private Cloud Base PUBLIC 69

Page 70: Supplement for Hadoop

NoteThe following process is for Hive but the process is similar for Apache Impala.

1. Open the DSConnectionManager.sh file or enter the following in a command prompt:

$ cd $LINK_DIR/bin/ $ ./DSConnectionManager.sh

The Data Services Connection Manager starts.2. Complete the prompts as they appear.

ExampleThe following code sample shows the Connection Manager prompts and values for configuring a DSN with a Cloudera ODBC client driver for Apache Hive. The prompts are similar for Apache Impala.

We've added bold font to draw your attention to the selection and entered values.

*************************************************** SAP Data Services Connection Manager***************************************************------------------Start Menu------------------Connection Manager is used to configure Data Sources or Drivers. 1: Configure Data Sources 2: Configure Drivers q: Quit ProgramSelect one command:'1' 1 ------------------Main Menu-Configure Data Sources------------------Data Sources: Index Database Type Name ------ ------ ------ No database source is configured.The command list:Use Ctrl+B+Enter to go back to the main menu. a: Add a new database source e: Edit an existing database source d: Delete an existing database source r: Replicate an existing database source s: Restart the Data Services Job Service and the EIM Adaptive Processing Server q: Quit to Start MenuSelect one command: a Specify one database type: 1) MySQL 2) Microsoft SQL Server via DataDirect 3) SAP HANA 4) IBM DB2 on iSeries or zSeries via DataDirect 5) Teradata 6) Netezza NPS 7) Sybase IQ 8) Sybase ASE 9) IBM Informix IDS 10) Attunity 11) SQL Anywhere 12) HP Vertica 13) Amazon Redshift 14) Apache Cassandra 15) Hive Server 16) Apache Impala 17) SAP Vora 18) Snowflake

70 PUBLICSupplement for Hadoop

Cloudera Data Platform (CDP) Private Cloud Base

Page 71: Supplement for Hadoop

19) PostgreSQL 20) Google BigQuerySpecify database index #'1' 15 ********************************Configuration for Hive Server******************************** The ODBC ini file is <path>/odbc.ini There are available DSN names in the file: <list of existing DSN files> Specify the DSN name from the list or add a new one: <new DSN name> Specify the User Name: <Hive user name> Specify the Hive Password Type[0:normal|1:file]:'0' 0 Type database password:(no echo) *The password doesn't appear when you type it for security reasons Retype database password:(no echo)Specify the Unix ODBC Lib Path: /<lib_path>/unixODBC-2.3.2/lib Specify the Driver: /<path>/64/libclouderahiveodbc64.so Specify the Driver Version: <driver.version> Specify the Host Name: <server host name> Specify the Port:'<port number>' Specify the Database:'default' Specify the Hive Service Discovery Mode[0:disabled|1:enabled]:'0' 0 Specify the Hive Auth Mech[0:noauth|1:kerberos|2:user|3:user-passwd]: 2 Specify the Hive Thrift Transport[0:binary|1:sasl|2:http]: 1 Specify the Hive SSL Mode[0:disabled|1:enabled]:'0' 0 Testing connection...Successfully added database source.

Task overview: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Checking your CDP Private Cloud Base connection [page 64]Download CDP ODBC client driver [page 65]Configure CDP ODBC client driver in Linux [page 65]Configure CDP Hive DSN-less connections in Windows [page 67]Configure CDP DSN on Windows [page 68]CDP datastore options for Apache Hive [page 72]CDP datastore options for Apache Impala [page 73]Configure CDP HDFS file location object [page 74]

Supplement for HadoopCloudera Data Platform (CDP) Private Cloud Base PUBLIC 71

Page 72: Supplement for Hadoop

7.7 CDP datastore options for Apache Hive

To connect to your Apache Hive data, create an SAP Data Services datastore.

If you plan to use a DSN (data source name) connection, configure the DSN either before you create the datastore, or during datastore creation. For information about creating a DSN, see Configure CDP DSN on Windows [page 68] or Configure CDP DSN on Linux [page 69].

To open the datastore editor, select Tools New Datastore.

The following table contains datastore options applicable for creating a datastore for Apache Hive. For descriptions for common and advanced datastore options, see the Designer Guide.

Option Value

Datastore Name Enter a unique name for the datastore.

Datastore Type Select Database.

Database Type Select Hive.

Database Subtype Select Hive Server 2.

Database Version Select Hive Server 2 auto.

Use data source name (DSN) Select to create a DSN connection.

If you select Use data source name (DSN), complete the following options:

Data Source Name Select the applicable DSN.

User Name Enter your Hive user name.

Password Enter your Hive password.

Complete the following options for DSN-less connection:

Database server name Enter the server name.

Hive Authentication Select the applicable authentication type. The remaining op­tions are based on the authentication type that you select.

Advanced Select Advanced and complete advanced options as applica­ble.

Parent topic: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Checking your CDP Private Cloud Base connection [page 64]Download CDP ODBC client driver [page 65]Configure CDP ODBC client driver in Linux [page 65]Configure CDP Hive DSN-less connections in Windows [page 67]Configure CDP DSN on Windows [page 68]Configure CDP DSN on Linux [page 69]

72 PUBLICSupplement for Hadoop

Cloudera Data Platform (CDP) Private Cloud Base

Page 73: Supplement for Hadoop

CDP datastore options for Apache Impala [page 73]Configure CDP HDFS file location object [page 74]

7.8 CDP datastore options for Apache Impala

To connect to your Apache Impala data, create an SAP Data Services datastore.

Configure a DSN (data source name) either before you create the datastore, or during datastore creation. For information about creating a DSN, see Configure CDP DSN on Windows [page 68] or Configure CDP DSN on Linux [page 69].

NoteA DSN connection is required for connecting to Impala.

To open the datastore editor, select Tools New Datastore.

The following table contains datastore options applicable for creating a datastore for Apache Impala. For descriptions for common and advanced datastore options, see the Designer Guide.

Option Value

Datastore Name Enter a unique name for the datastore.

Datastore Type Select Database.

Database Type Select ODBC.

Data Source Name Select the DSN that you created for this connection.

User Name Enter your Impala user name.

Password Enter you Impala password.

Advanced Select Advanced and set advanced options as applicable.

Parent topic: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Checking your CDP Private Cloud Base connection [page 64]Download CDP ODBC client driver [page 65]Configure CDP ODBC client driver in Linux [page 65]Configure CDP Hive DSN-less connections in Windows [page 67]Configure CDP DSN on Windows [page 68]Configure CDP DSN on Linux [page 69]CDP datastore options for Apache Hive [page 72]Configure CDP HDFS file location object [page 74]

Supplement for HadoopCloudera Data Platform (CDP) Private Cloud Base PUBLIC 73

Page 74: Supplement for Hadoop

7.9 Configure CDP HDFS file location object

Use a Hadoop distributed file system (HDFS) file location to access your Hadoop data in your CDP (Cloudera Data Platform) Private Cloud Base for SAP Data Services processing.

Before you perform the following steps, follow the instructions in HDFS file location objects [page 27].

1. Right-click the File Locations node in the Formats tab of the object library and select New.2. Enter a unique name for the file location object in Name.3. Select Hadoop for Protocol.4. Select WEBHDFS for Communication Protocol.5. Enter the host for your CDP clusters in Host.6. Enter the port number, user name and password for your CDP clusters in Port, User, and Password,

respectively.7. Enter remaining applicable values in the File Location editor.

For complete information about file location objects and options, see the Designer Guide.

Task overview: Cloudera Data Platform (CDP) Private Cloud Base [page 63]

Related Information

Checking your CDP Private Cloud Base connection [page 64]Download CDP ODBC client driver [page 65]Configure CDP ODBC client driver in Linux [page 65]Configure CDP Hive DSN-less connections in Windows [page 67]Configure CDP DSN on Windows [page 68]Configure CDP DSN on Linux [page 69]CDP datastore options for Apache Hive [page 72]CDP datastore options for Apache Impala [page 73]

74 PUBLICSupplement for Hadoop

Cloudera Data Platform (CDP) Private Cloud Base

Page 75: Supplement for Hadoop

8 Google Cloud Dataproc

To connect to an Apache Hadoop web interface running on Google Cloud Dataproc clusters, use a Hive database datastore and a WebHDFS file location.

Use a Hive datastore to browse and view metadata from Hadoop and to import metadata for use in data flows. To upload processed data, use a Hadoop file location and a Hive template table. Implement bulk loading in the target editor in a data flow where you use the Hive template table as a target.

Before you use a Hive datastore for accessing Hadoop data through Google Cloud Dataproc, ensure that you properly configure your Dataproc cluster to allow access from third-party clients. For information and instructions, see your Google Cloud Platform documentation at https://cloud.google.com/products/#data-analytics .

Prepare for accessing Google Dataproc by performing the following tasks:

● Download and configure the supported Hortonworks or Cloudera ODBC driver.● Create a data source name (DSN) for the connection type in the datastore.● Enter the IP address for the port-forwarding machine in the DSN Host field.

After all prerequisite tasks are complete, create the Hive datastore and the WebHDFS file location.

Configure driver and data source name (DSN) [page 76]The Hive database datastore requires a supported ODBC driver and a data source name (DSN) connection.

Hive database datastore for Google Dataproc [page 77]The Hive database datastore requires user name and password information to your Google Cloud Dataproc cluster.

Create a WebHDFS file location [page 78]Upload generated data to your WebHDFS through Google Cloud Dataproc cluster by creating a file location.

Related Information

Hive database datastores [page 49]Connect to HDFS [page 27]Configuring bulk loading for Hive [page 53]

Supplement for HadoopGoogle Cloud Dataproc PUBLIC 75

Page 76: Supplement for Hadoop

8.1 Configure driver and data source name (DSN)

The Hive database datastore requires a supported ODBC driver and a data source name (DSN) connection.

Before you create the DSN, download the supported ODBC driver from either Cloudera or Hortonworks. For version information, see the Product Availability Matrix (PAM) on the Customer Support portal.

Create a DSN on Windows

Verify that the ODBC driver is installed by using the ODBC Drivers Selector utility. Then, create a DSN using the Microsoft ODBC Data Source Administrator utility.

NoteFor instructions to configure the DSN, see Creating a DSN connection with SSL protocol in Windows [page 51].

The following table contains the options for creating a DSN in the ODBC Data Source Administrator utility applicable to configuring a DSN for HiveServer2.

Option Value

Hive Server Type Hive Server 2

Service Discovery Mode No Service Discovery

Host(s) Internal IP address for the Google Cloud Dataproc VM where you established port forwarding.

Port Enter the Google Cloud port number for HiveServer2. The default port for HiveServer2 in Google Cloud Dataproc is 10000.

Database The name of the Google Cloud Dataproc network.

Create a DSN on Linux

Configure the driver and create a DSN using the SAP Data Services Connection Manager.

NoteFor steps, see Configuring ODBC driver with SSL protocol for Linux [page 52].

Ensure that you enter the internal IP address for the Google Cloud Dataproc VM where you established port forwarding for the Host name.

Parent topic: Google Cloud Dataproc [page 75]

76 PUBLICSupplement for Hadoop

Google Cloud Dataproc

Page 77: Supplement for Hadoop

Related Information

Hive database datastore for Google Dataproc [page 77]Create a WebHDFS file location [page 78]Using the ODBC Drivers Selector for Windows

8.2 Hive database datastore for Google Dataproc

The Hive database datastore requires user name and password information to your Google Cloud Dataproc cluster.

For the Hive datastore, the DSN that you previously created contains the connection information to your Google Dataproc cluster. Ensure that you select the applicable DSN when you create the datastore.

The following table contains the applicable settings to make when you create the Hive datastore.

Option Value

Datastore Type Database

Database Type Hive

Datastore Subtype Hive Server 2

Data Source Name The name of the DSN you created when you configured the ODBC driver.

User Name The user name associated with the project where you config-ured the Google Dataproc cluster.

Password The password associated with the project where you config-ured the Google Dataproc cluster.

Parent topic: Google Cloud Dataproc [page 75]

Related Information

Configure driver and data source name (DSN) [page 76]Create a WebHDFS file location [page 78]Hive database datastores [page 49]

Supplement for HadoopGoogle Cloud Dataproc PUBLIC 77

Page 78: Supplement for Hadoop

8.3 Create a WebHDFS file location

Upload generated data to your WebHDFS through Google Cloud Dataproc cluster by creating a file location.

The following table contains options in the file location editor that are specific to Google Cloud Dataproc. Complete the other options as applicable.

RestrictionBefore you use the WebHDFS file location, configure the host on the Job Server following the steps in Configuring host for WebHDFS file location [page 79].

Option Value

Protocol Hadoop

Communication Protocol WEBHDFS

Host Internal IP address for the Google Cloud Dataproc VM where you established port forwarding.

User The user name associated with the project where you config-ured the Google Dataproc cluster.

Password Password associated with the project where you configured the Google Dataproc cluster.

Remote Directory Path for the working directory in Google Cloud.

Local Directory Path for your local working directory.

Replication Factor Google Cloud Dataproc has a default replication factor of 2.

Configuring host for WebHDFS file location [page 79]Before you test or use the WebHDFS file location for the first time, you must map the IP address to the Dataproc cluster host name in the HOSTS file.

Parent topic: Google Cloud Dataproc [page 75]

Related Information

Configure driver and data source name (DSN) [page 76]Hive database datastore for Google Dataproc [page 77]Creating a file location objectHDFS file location object options [page 28]

78 PUBLICSupplement for Hadoop

Google Cloud Dataproc

Page 79: Supplement for Hadoop

8.3.1 Configuring host for WebHDFS file location

Before you test or use the WebHDFS file location for the first time, you must map the IP address to the Dataproc cluster host name in the HOSTS file.

1. Open the HOSTS file with a text editor.

Location of the HOSTS file:○ Linux: ~/etc/hosts○ Windows: C:\Windows\System32\drivers\etc\hosts

2. On a new line in the hosts file, enter information that maps the IP address for the port forward machine to your Google Dataproc cluster host using the following syntax:

<IP_port_forward_host> <dataproc_master_node_name>.c.<dataproc_cluster_name>.internal

The master node name is the name of your Dataproc cluster followed by “-m”. Also add a line for each worker node as applicable. The worker node name is the name of your Dataproc cluster followed by “-w”.

ExampleAdd a line to your HOSTS file for the master node and worker nodes using the following information:○ Internal IP of the port forward machine: 10.160.205.211○ Dataproc master node name: My_Cluster○ Dataproc project name: My_Project

The strings that you add to the HOSTS file appear as follows:

10.160.205.211 My_Cluster-m.c.My_Project.internal 10.160.205.211 My_Cluster-w-0.c.My_Project.internal 10.160.205.211 My_Cluster-w-1.c.My_Project.internal

3. Save and close the hosts file.

Task overview: Create a WebHDFS file location [page 78]

Supplement for HadoopGoogle Cloud Dataproc PUBLIC 79

Page 80: Supplement for Hadoop

9 SAP Big Data Services

Access data from your Kerberos-secured Hive cluster that resides in SAP Big Data Services using a Hive datastore and a Hadoop file location in SAP Data Services.

Big Data Services enables you to use Hadoop in the cloud without having to maintain and update Hadoop.

What can you do in Big Data services

The Big Data Services includes a Workbench, which is a gateway machine that provides access to your Hadoop data. The Workbench also provides other services to help you access and maintain your data. To access the Workbench, establish an SSH tunnel to Big Data Services.

The following list contains the basic tasks to process data in Data Services from your Hive cluster:

● Request an account for Big Data Services.● Log in to the Big Data Services portal and obtain the required Workbench account information and public-

private key pair.● In PuTTY, enter account information and create a profile and an SSH tunnel to your Workbench account.● In Data Services, download and install the applicable ODBC driver, and set environment variables.● Create an HDFS database datastore using either a server name (DSN-less) connection or a DSN

connection.● Download metadata from your Hadoop or Hive cluster using the datastore. Then use the imported objects

as sources or targets in dataflows.● Relate your target object with a file location object that contains the file transfer protocol information that

Data Services uses to transfer data back to your Hadoop or Hive cluster.

For more information about SAP Big Data Services, see the documentation located on the help portal.

Setting SSH tunneling with port forwarding on Windows [page 81]Obtain required information from your SAP Big Data Services Account Overview page and establish an SSH tunnel using PuTTY.

Setting SSH tunneling with port forwarding on UNIX or Linux [page 82]Obtain required information from your SAP Big Data Services Account Overview page and establish an SSH tunnel using PuTTY.

ODBC driver requirements [page 83]Download and install the specific SAP Big Data Services ODBC driver.

Generating API tokens for the Hive Server [page 85]Generate an API token and use it as the password in your datastore or file location to access your Kerberos-secure Hive Cluster.

Generating SWebHdfs delegation token for the HttpFS Service [page 86]Generate an SWebHdfs delegation token and use it as a password in your datastore or file location to access your Hadoop Cluster.

Obtaining an SSL certificate file [page 86]

80 PUBLICSupplement for HadoopSAP Big Data Services

Page 81: Supplement for Hadoop

Obtain a copy of the the Hive SSL certificate file and use it when you configure encryption parameters for the database datastore.

Creating a DSN connection for Windows [page 87]To create a DSN connection for the Hive ODBC driver on a Windows platform, use the ODBC Data Source Administrator.

Creating a DSN connection for UNIX or Linux [page 88]After you download and install the applicable Hive ODBC driver, configure it using the SAP Data Services Connection Manager.

Creating a file location for Kerberos-secured Hadoop cluster [page 89]To define the file transfer protocol for uploading data from SAP Data Services to a Kerberos-secured Hadoop cluster in SAP Big Data Services, use a file location object.

Creating a datastore for Kerberos-secured Hive cluster [page 90]To download data from your kerberos-secured Hive cluster, create a datastore.

Upload data to HDFS in the cloud [page 92]Upload data processed with Data Services to your HDFS that is managed by SAP Big Data Services.

9.1 Setting SSH tunneling with port forwarding on Windows

Obtain required information from your SAP Big Data Services Account Overview page and establish an SSH tunnel using PuTTY.

The following steps are also documented with other processes in the SAP Big Data Services documentation at https://help.sap.com/viewer/50f26aa0f2044127bc5f6d5ad3d090fe/Cloud/en-US/2e603f7d76b8101480abd76da746ffe7.html.

To access your Kerberos-secured Hive cluster, use port 10010. Because port 10010 is not exposed on the open Internet, create an SSH tunnel that forwards port 10010 on your local machine to port 10010 in the Hive cluster behind the firewall.

1. Log in to Big Data Services to view the Account Overview page in the Cluster tab.2. Scroll to Your Connection Information, open the Hadoop tab, and view the Hive Server2 information.

For example, the information for Hive Server 2 includes the following information: httpfs-<cluster_name>-s3s.altiscale.com

3. Open PuTTY and load the Big Data Services session that you created when you set your environment.4. In PuTTY, open the Connection node in the Category list at left, expand SSH, and select Tunnels.5. Add 10010 to the Source ports option under Add new forwarded port.

6. Enter information in Destination using the following format: hostname:port.

For example,

httpfs-<cluster_name>-s3s.altiscale.com:10010

7. Select Local and Auto.8. Click Add.

Supplement for HadoopSAP Big Data Services PUBLIC 81

Page 82: Supplement for Hadoop

9. Open the Sessions node at left and click Save.

9.2 Setting SSH tunneling with port forwarding on UNIX or Linux

Obtain required information from your SAP Big Data Services Account Overview page and establish an SSH tunnel using PuTTY.

The following steps are also documented with other processes in the SAP Big Data Services documentation at https://help.sap.com/viewer/50f26aa0f2044127bc5f6d5ad3d090fe/Cloud/en-US/2e603f7d76b8101480abd76da746ffe7.html.

For the Kerberos-secured port number, we use 10010 in our documentation. You can also use 10010 if it isn't already used. Because Port 10010 is not exposed to the open Internet, create an SSH tunnel that forwards port 10010 on your local machine to port 10010 on the Hive cluster behind the firewall.

1. Configure Workbench access following steps in the SAP Big Data Services documentation.

Find steps in “Configuring Workbench Access from Mac/Linux” in the Getting Started Guide for SAP Big Data Services at https://help.sap.com/viewer/50f26aa0f2044127bc5f6d5ad3d090fe/Cloud/en-US/2e602f8c76b8101480abd76da746ffe7.html.

2. Set your SSH configuration options.

Example ~/ .ssh/config Host <NAME_OF_CONFIGURATION> User <USERNAME> HostName <WORKBENCH_HOST> Port <WORKBENCH_PORT> IdentityFile ~/.ssh/<your_private_key> Compression yes ServerAliveInterval 15 TCPKeepAlive yes DynamicForward localhost:1080 LocalForward 10010 <cluster_name>:10010

SSH configuration option descriptions

Option Description

Host Specifies a name for the configuration. Later, use this name to access your workbench. For example, with a Host of “bigdata”, access your workbench by entering the fol­lowing command:

ssh bigdata

UserHostNamePort

Specifies the values from the Account Overview page of your Big Data Services portal listed for the Workbench.

82 PUBLICSupplement for HadoopSAP Big Data Services

Page 83: Supplement for Hadoop

Option Description

IdentityFile Specifies the file name that you used to save the SSH pri­vate key. If you generated your key with the ssh-keygen tool and did not provide your own filename, the software assigns the default file name of id_rsa.

LocalForward Specifies the Hive Server 2 information and the port num­ber, separated with a colon. Obtain the information from your Big Data Services Account Overview page. Scroll to Your Connection Information, open the Hadoop tab, and view the Hive Server2 information.

9.3 ODBC driver requirements

Download and install the specific SAP Big Data Services ODBC driver.

Administrators for Big Data Services ensure that the drivers are compatible with the Hive version. For example, administrators may release a new ODBC driver to fix a bug. Therefore, it is important to check with your administrator about the correct driver version to use.

Restriction

For important driver version information, see 2784620 .

For procedural information, read about installing ODBC drivers in the SAP Big Data Services User Guide at https://help.sap.com/viewer/50f26aa0f2044127bc5f6d5ad3d090fe/Cloud/en-US/2e6322ad76b8101480abd76da746ffe7.html.

After you download and install the ODBC driver, configure the driver using the configuration utility that supports your operating system.

ODBC driver configuration utilities

Platform Configuration utility

Windows ODBC Drivers Selector

UNIX or Linux SAP Data Services Configuration Manager

About the ODBC Drivers Selector for Windows [page 84]After you download and install the ODBC driver, use the ODBC Drivers Selector to set the applicable environment variables for the driver.

Configuring ODBC Hive driver with DS Connection Manager [page 84]For UNIX or Linux platforms, use the SAP Data Services Connection Manager to configure the ODBC Hive driver.

Related Information

Supplement for HadoopSAP Big Data Services PUBLIC 83

Page 84: Supplement for Hadoop

Using the ODBC Drivers Selector for WindowsUsing the Connection Manager

9.3.1 About the ODBC Drivers Selector for Windows

After you download and install the ODBC driver, use the ODBC Drivers Selector to set the applicable environment variables for the driver.

By default, SAP Data Services installs the ODBCDriversSelector.exe to <DS_COMMON_DIR>\bin.

Find complete steps in the SAP Data Services Administrator Guide. Set the following options specifically for the Hive ODBC driver:

1. Scroll to the applicable Hive driver under the Database version column.2. Click the dropdown arrow in the ODBC Driver column, and select the Hive driver.3. Ensure that the value in the State column is “Installed”. If it does not say that it is installed, make sure that

you downloaded and installed the driver correctly. Or, make sure that you have Administrator privilege to the DSConf.txt file.

After you set the options in the DSDriversSelector.exe utility, open DS_COMMON_DIR\conf\migrated\DSConf.txt and verify that the data source and driver are present in the [DBClientDrivers] section. The format is <DATASOURCE_VERSION>=<ODBC_DRIVER>.

Parent topic: ODBC driver requirements [page 83]

Related Information

Configuring ODBC Hive driver with DS Connection Manager [page 84]Using the ODBC Drivers Selector for Windows

9.3.2 Configuring ODBC Hive driver with DS Connection Manager

For UNIX or Linux platforms, use the SAP Data Services Connection Manager to configure the ODBC Hive driver.

For helpful information about the ODBC Hive driver configuration, see 2776435 .

1. Open the Data Services Connection Manager by entering the following command:

~ ./DSConnectionManager.sh

2. Enter information when the Connection Manager prompts you as shown in the following example:The following example shows the Connection Manager prompts and sample values to create a DSN named “hive_bds_krb5”:

84 PUBLICSupplement for HadoopSAP Big Data Services

Page 85: Supplement for Hadoop

Example

Configuration for Hive Server ****************************************************** Specify the Driver Version: 2.1.2 Specify the Driver Name: HIVE_BDS_DRIVER Specify the User Name: <user_name> Specify the Hive Password Type [0:normal|1:file]:'0' 1 Specify the Hive Password Files: ' ' /build/<user_name>/bds/BDSToken.txt Specify the Driver: /opt/altiscale/hiveodbc/lib/64/libalsiscalehiveodbc64.so Specify the Host Name: localhost Specify the Port: '10000' 10010 Specify the Unix ODBC Lib Path: /build/<user_name>/unixODBC-2.3.2/lib Specify the Hive Auth Mech [0:noauth|1:kerberos|2:user|3:user-passwd]: 3 Specify the Hive SSL Mode [0:disabled|1:enabled]:'0' 1 Specify the Hive SSL Server Certificate File: /build/<user_name>/bds/sap2.crt Specify the Hive twoway SSL [0:disabled|1:enabled]:'0'Testing connection... ̂C[<user_name>@<server_name>:/build/<user_name>/ds<version>/daaservices/bin]$

For the Specify the Hive Password Files option: Enter the location of the locally-saved token that you obtained from your Big Data Services Hive Server 2.

For the Specify the Port option: Enter the port number, 10010, or the port number you used for the Kerberos-secured Hive cluster.

Task overview: ODBC driver requirements [page 83]

Related Information

About the ODBC Drivers Selector for Windows [page 84]

9.4 Generating API tokens for the Hive Server

Generate an API token and use it as the password in your datastore or file location to access your Kerberos-secure Hive Cluster.

1. Log in to your SAP Big Data Services portal, click the User Menu dropdown arrow in the upper right, and select API Tokens from the list.

Supplement for HadoopSAP Big Data Services PUBLIC 85

Page 86: Supplement for Hadoop

The API Tokens page appears at right.2. Select Apache Hive - HiveServer2 Thrift API from the Select token type dropdown list.3. Select the duration from the Select token duration dropdown list.4. Click Create Token.

Big Data Services creates the token and displays the token content along with the token ID and expiration date.

5. Click Download and save the token content to a local file.

After you click Download, the token is no longer available for download for security reasons.

Copy the token and paste it into the Password option when you create either a file location or a datastore object.

9.5 Generating SWebHdfs delegation token for the HttpFS Service

Generate an SWebHdfs delegation token and use it as a password in your datastore or file location to access your Hadoop Cluster.

If as a result of the following steps you get a message stating the token is expried before the actual expiration date, see 2775794 .

1. Log in to the SAP Big Data Services workbench using PuTTY and your SSH tunnel, and generate a Kerberos delegation token using the following command:

hadoop dtutil get swebhdfs://httpfs-<cluster_name>.s3s.altiscale.com:10010/ -renewer "" -format java -alias httpfs-<cluster_name>.s3s.altiscale.com token

2. To display the delegation token content, enter the following command:

hadoop dtutil print token

3. Copy and save the token to a local file.

Copy the token and paste it into the Password option when you create either a file location or a datastore object.

9.6 Obtaining an SSL certificate file

Obtain a copy of the the Hive SSL certificate file and use it when you configure encryption parameters for the database datastore.

1. Log in to your Big Data Services Workbench using PuTTY and the SSH tunnel.2. Create an empty local file and name it appropriately.

86 PUBLICSupplement for HadoopSAP Big Data Services

Page 87: Supplement for Hadoop

ExampleFor example, create an empty file named hive_bds_ssl_crt.pem in <LINK_DIR>\ssl\hive\hive_bds_ssl_crt.pem.

3. Copy the contents of the SSL certificate file that is located in the SAP Big Data Services cluster.

Find the certificate file in /etc/altiscale/ca/prod/ca-bundle.crt of the Big Data Services cluster.

4. Paste the copied contents of the SSL certificate file into the local file that you created.

Later, when you configure the datastore, enter the file name and location in the Encryption Parameters dialog box under the Advanced options of the datastore editor.

9.7 Creating a DSN connection for Windows

To create a DSN connection for the Hive ODBC driver on a Windows platform, use the ODBC Data Source Administrator.

Download, install, and configure the applicable Hive ODBC driver.

Generate the Kerberos token in SAP Big Data Services and save it to a local file.

Copy the SSL certificate from Big Data Services and save it to a local file.

1. Open the ODBC Data Source Administrator from the Start menu in Windows or from the datastore editor when you create or edit a database datastore.

2. Select the applicable Hive ODBC driver and click Add.

Before the Hive ODBC driver appears in the list of drivers, you have to install and configure the driver.3. In the Altiscale Hive ODBC Driver DSN Setup dialog box, select Hive Server 2 from the Hive Server Type

dropdown list.4. Enter a name in the Data Source Name text box.

This name is the DSN connection name that you select when you create the Hive database datastore.5. Select No Service Discovery for Service Discovery Mode.6. Enter localhost for Host(s).

7. Enter 10010 for Port.

This value is the port for the Hive Server 2. If you used a different port than 10010, enter that value instead.8. Keep default for Database.9. In the Authentication group, select User Name and Password for Mechanism.10. Enter your Big Data Services user name for User Name.11. Open the local file in which you copied the Kerberos delegation token, copy, and paste into Password.12. Select SASL for Thrift Transport.13. Click SSL Options.

The SSL Options dialog box opens.14. Click to select the following options:

Supplement for HadoopSAP Big Data Services PUBLIC 87

Page 88: Supplement for Hadoop

○ Enable SSL○ Allow Common Name Host Name Mismatch○ Allow self-signed Server Certificate

15. Enter the location and file name for the SSL certificate that you saved locally for Trusted Certificates.

Optionally, use the Browse button to browse and select the file.16. Click OK.

Use the DSN connection that you just created for the connection in the Hive database datastore.

9.8 Creating a DSN connection for UNIX or Linux

After you download and install the applicable Hive ODBC driver, configure it using the SAP Data Services Connection Manager.

1. Open the Data Services Connection Manager by entering the following command:

~ ./DSConnectionManager.sh

2. Enter information when the Connection Manager prompts you as shown in the following example.The following example shows the prompts and sample values when you create a DSN named hive_bds_krb5 in the Connection Manager:

Example

Specify the DSN name from the list or add a new one: hive_bds_krb5 Specify the User Name: <user_name> Specify the Hive Password type [0:normal|1:file]: '0' 1 Specify the Hive Password File: /build/<user_name>/bds/BDSToken.txt Specify the Unix ODBC Lib Path: /build/<user_name>/unixODBC-2.3.2/lib Specify the Drivers: /opt/altiscale/hiveodbc/lib/64/libaltiscalehiveodbc64.so Specify the Driver Version 2.1.2 Specify the Host Name: localhost Specify the Port: '10000' 10010Specify the Database: 'default'Specify the Hive Service Discovery Mode [0:disabled|1:enabled]:'0' Specify the Hive Auth Mech [0:noauth|1:kerberos|2:user|3:user-passwd]: 3 Specify the Hive Thrift Transport [0:binary|1:sasl|2:http]: 1 Specify the Hive SSL Mode [0:disabled|1:enabled]: '0' 1 Specify the Hive SSL Server Certificate File: /build/<user_name>/bds/sap2.crt Specify the Hive twoway SSL [0:disabled|1:enabled]: '0' Testing connection...

88 PUBLICSupplement for HadoopSAP Big Data Services

Page 89: Supplement for Hadoop

9.9 Creating a file location for Kerberos-secured Hadoop cluster

To define the file transfer protocol for uploading data from SAP Data Services to a Kerberos-secured Hadoop cluster in SAP Big Data Services, use a file location object.

● Set up your environment to use SSH tunneling to access your Workbench and your Hive cluster. For instructions, see Setting SSH tunneling with port forwarding on Windows [page 81] or Setting SSH tunneling with port forwarding on UNIX or Linux [page 82].

● Generate and download a Kerberos delegation token from your Big Data Services portal.

1. Open the Formats tab in the local library, right-click the File Location node, and select New.

The Create New File Location dialog box opens.2. Enter a unique name for the file location.3. Select Hadoop for Protocol.

Data Services automatically populates the Communication Protocol WEBHDFS options.4. Enter localhost for Host.

5. Enter 14000 for Port.

NoteEnsure that you complete the prerequisite task to establish SSH tunneling with port forwarding for the port number that you enter for Port.

6. Enter your Big Data Services user name for User.7. Paste the delegation token into Password.

The delegation token is valid only for a few hours. If it expires, log in to your Big Data Services portal and generate another Kerberos token.

8. Select Delegation Token for Authentication Type.9. Select Yes for SSL Enabled.10. Enter the applicable values in Remote Directory and Local Directory.11. Click OK to save the new file location object.

Data Services saves the new file location under the File Location node in the Formats tab.

Related Information

File location common options

Supplement for HadoopSAP Big Data Services PUBLIC 89

Page 90: Supplement for Hadoop

9.10 Creating a datastore for Kerberos-secured Hive cluster

To download data from your kerberos-secured Hive cluster, create a datastore.

● Set up your environment to use SSH tunneling to access your Hive cluster. For instructions, see Setting SSH tunneling with port forwarding on Windows [page 81] or Setting SSH tunneling with port forwarding on UNIX or Linux [page 82].

● Download and install the Hive ODBC driver.● For a DSN connection, create a DSN connection with the Hive ODBC driver.● Generate a Kerberos authentication token in your Big Data Services portal.● Copy an SSL certificate file from Big Data Services and save to a local file.

1. Create a database datastore and complete the common options as described in the Designer Guide.2. Complete the Hive options as described in the following table.

Hive datastore option descriptions

Option Value DSN or DSN-less

Datastore Type Select Database Both

Database Type Select Hive Both

Database Subtype Auto populates to Hive Server2 Both

Database Version Select Hive Server2 v1.2 Both

Use data source name (DSN) Select to indicate this datastore has a DSN connection.

For a server name connection, do not select this option.

DSN

90 PUBLICSupplement for HadoopSAP Big Data Services

Page 91: Supplement for Hadoop

Option Value DSN or DSN-less

ODBC Admin... Click to open the ODBC Data Source Administrator to create a DSN con­nection.

Not applicable if the DSN already ex­ists.

NoteIf you did not create a DSN con­nection before creating this data­store, click to open the ODBC Data Source Administrator and create a DSN following the steps in Creating a DSN connection for Windows [page 87] or Creating a DSN connection for UNIX or Linux [page 88].

Available only when you select Use data source name (DSN).

DSN

Data Source Name Select the DSN from the dropdown list.

Available only when you select Use data source name (DSN).

DSN

Database server name Enter localhost. DSN-less

Port Enter 10010 or the port number for which you set port forwarding.

DSN-less

Hive Authentication Select User Name and Password. DSN-less

User Name Enter the user name associated with the Hive cluster.

Both

Password Paste the Kerberos delegation token for Password.

The delegation token is valid only for a few hours. If it expires, log in to your Big Data Services portal and generate another Kerberos token. For steps, see Generating API tokens for the Hive Server [page 85].

Both

3. Under Advanced, open the Encryption Parameters dialog box by double-clicking in the empty cell for the Encryption parameters option. Complete the parameters as described in the following table:

Supplement for HadoopSAP Big Data Services PUBLIC 91

Page 92: Supplement for Hadoop

Encryption option descriptions

Option Description

Allow Common Name Host Name Mismatch Specifies whether Data Services accepts a mismatch be­tween an SSL certificate name issued by a certificate au­thority (CA) and the host name of the Hive server.

Select to enable.

Allow Self-signed Server Certificate Specifies whether Data Services allows a self-signed SSL server certificate. For a self-signed certificate, the same entity identified in the certificate signs the certificate.

Select to enable.

Trusted Certificates Specifies the location of the certificate file.

Enter or browse for the path and file name of the certifi-cate you obtained from your Big Data Services Hive clus­ter.

4. Click OK to close the Encryption Parameters dialog box.5. Complete the remaining Advanced options as applicable.6. Click OK to save your new datastore.

The datastore appears in the Datastores tab in the object library.

Open the datastore and import data from your Hive cluster.

Related Information

Import metadata from database datastores

9.11 Upload data to HDFS in the cloud

Upload data processed with Data Services to your HDFS that is managed by SAP Big Data Services.

Big Data Services is a Hadoop distribution in the cloud. Big Data Services performs all Hadoop upgrades and patches for you and provides Hadoop support. SAP Big Data Services was formerly known as Altiscale.

Upload your big data files directly from your computer to Big Data Services. Or, upload your big data files from your computer to an established cloud account, and then to Big Data Services.

ExampleAccess data from S3 (Amazon Simple Storage Service) and use the data as a source in Data Services. Then upload the data to your HDFS that resides in Big Data Service in the cloud.

How you choose to upload your data is based on your use case.

92 PUBLICSupplement for HadoopSAP Big Data Services

Page 93: Supplement for Hadoop

For complete information about accessing your Hadoop account in Big Data Services and uploading big data, see the Supplement for SAP Big Data Services.

Supplement for HadoopSAP Big Data Services PUBLIC 93

Page 94: Supplement for Hadoop

Important Disclaimers and Legal Information

HyperlinksSome links are classified by an icon and/or a mouseover text. These links provide additional information.About the icons:

● Links with the icon : You are entering a Web site that is not hosted by SAP. By using such links, you agree (unless expressly stated otherwise in your agreements with SAP) to this:

● The content of the linked-to site is not SAP documentation. You may not infer any product claims against SAP based on this information.● SAP does not agree or disagree with the content on the linked-to site, nor does SAP warrant the availability and correctness. SAP shall not be liable for any

damages caused by the use of such content unless damages have been caused by SAP's gross negligence or willful misconduct.

● Links with the icon : You are leaving the documentation for that particular SAP product or service and are entering a SAP-hosted Web site. By using such links, you agree that (unless expressly stated otherwise in your agreements with SAP) you may not infer any product claims against SAP based on this information.

Videos Hosted on External PlatformsSome videos may point to third-party video hosting platforms. SAP cannot guarantee the future availability of videos stored on these platforms. Furthermore, any advertisements or other content hosted on these platforms (for example, suggested videos or by navigating to other videos hosted on the same site), are not within the control or responsibility of SAP.

Beta and Other Experimental FeaturesExperimental features are not part of the officially delivered scope that SAP guarantees for future releases. This means that experimental features may be changed by SAP at any time for any reason without notice. Experimental features are not for productive use. You may not demonstrate, test, examine, evaluate or otherwise use the experimental features in a live operating environment or with data that has not been sufficiently backed up.The purpose of experimental features is to get feedback early on, allowing customers and partners to influence the future product accordingly. By providing your feedback (e.g. in the SAP Community), you accept that intellectual property rights of the contributions or derivative works shall remain the exclusive property of SAP.

Example CodeAny software coding and/or code snippets are examples. They are not for productive use. The example code is only intended to better explain and visualize the syntax and phrasing rules. SAP does not warrant the correctness and completeness of the example code. SAP shall not be liable for errors or damages caused by the use of example code unless damages have been caused by SAP's gross negligence or willful misconduct.

Bias-Free LanguageSAP supports a culture of diversity and inclusion. Whenever possible, we use unbiased language in our documentation to refer to people of all cultures, ethnicities, genders, and abilities.

94 PUBLICSupplement for Hadoop

Important Disclaimers and Legal Information

Page 95: Supplement for Hadoop

Supplement for HadoopImportant Disclaimers and Legal Information PUBLIC 95

Page 96: Supplement for Hadoop

www.sap.com/contactsap

© 2022 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. The information contained herein may be changed without prior notice.

Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary.

These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.

SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. All other product and service names mentioned are the trademarks of their respective companies.

Please see https://www.sap.com/about/legal/trademark.html for additional trademark information and notices.

THE BEST RUN