getting started with cdp private cloud base upgrade and

35
Cloudera Manager 7 Getting Started with CDP Private Cloud Base Upgrade and Migration Date published: 2020-08-10 Date modified: 2020-09-30 https://docs.cloudera.com/

Upload: others

Post on 25-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager 7

Getting Started with CDP Private Cloud BaseUpgrade and MigrationDate published: 2020-08-10Date modified: 2020-09-30

https://docs.cloudera.com/

Page 2: Getting Started with CDP Private Cloud Base Upgrade and

Legal Notice

© Cloudera Inc. 2021. All rights reserved.

The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual propertyrights. No license under copyright or any other intellectual property right is granted herein.

Copyright information for Cloudera software may be found within the documentation accompanying each component in aparticular release.

Cloudera software includes software from various open source or other third party projects, and may be released under theApache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms.Other software included may be released under the terms of alternative open source licenses. Please review the license andnotice files accompanying the software for additional licensing information.

Please visit the Cloudera software product page for more information on Cloudera software. For more information onCloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss yourspecific needs.

Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility norliability arising from the use of products, except as expressly agreed to in writing by Cloudera.

Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregisteredtrademarks in the United States and other countries. All other trademarks are the property of their respective owners.

Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA,CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OFANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY ORRELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THATCLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BEFREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTIONNOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS.WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLELAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOTLIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, ANDFITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASEDON COURSE OF DEALING OR USAGE IN TRADE.

Page 3: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager | Contents | iii

Contents

CDP Upgrade and Migrations Paths......................................................................4

CDP PVC Base new features.................................................................................. 6

Supported In-Place Upgrade Paths...................................................................... 10Cloudera Manager support for Cloudera Runtime and CDH............................................................................ 13

Assessing the Impact of Apache Hive...................................................................14Apache Hive features......................................................................................................................................... 14Apache Hive 3 architectural overview...............................................................................................................16Identifying semantic changes and workarounds................................................................................................ 17

Casting timestamps................................................................................................................................. 17Changing incompatible column types.................................................................................................... 18Creating tables........................................................................................................................................ 18Handling table reference syntax.............................................................................................................19Add Backticks to Table References.......................................................................................................19Handling the keyword APPLICATION................................................................................................. 19Disabling partition type checking.......................................................................................................... 20Dropping partitions................................................................................................................................. 20Granting Permissions to Roles............................................................................................................... 20Handling output of greatest and least functions.................................................................................... 21Renaming tables......................................................................................................................................21

Hive unsupported interfaces and features.......................................................................................................... 21

Supplemental Upgrade Topics.............................................................................. 22Configuring a Local Package Repository.......................................................................................................... 22

Creating a Permanent Internal Repository............................................................................................. 23Creating a Temporary Internal Repository............................................................................................ 24Configuring Hosts to Use the Internal Repository................................................................................ 24

Configuring a Local Parcel Repository..............................................................................................................25Using an Internally Hosted Remote Parcel Repository......................................................................... 25Using a Local Parcel Repository............................................................................................................27

Transitioning from the Cloudera Manager Embedded PostgreSQL Database Server to an ExternalPostgreSQL Database....................................................................................................................................28

Prerequisites............................................................................................................................................ 28Identify Roles that Use the Embedded Database Server...................................................................... 29Migrate Databases from the Embedded Database Server to the External PostgreSQL Database

Server.................................................................................................................................................31Using Replication Manager to migrate to CDP Private Cloud Base from CDH...............................................34

Replication Manager considerations while using CDH clusters............................................................34Cluster upgrade supported and unsupported features............................................................................ 34

Page 4: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager CDP Upgrade and Migrations Paths

CDP Upgrade and Migrations Paths

Take a look at the overview, features, and advantages of CDP and know the upgrade and migration paths from CDHor HDP platform to CDP.

Introduction to CDP

The merger of Cloudera and Hortonworks led to the new Cloudera Data Platform or CDP, which is the combined bestof breed Big Data components from both Cloudera and Hortonworks.

Review the following information before you upgrade or migrate to Cloudera Data Platform (CDP):

• CDP overview• CDP Private Cloud Base new features

CDP upgrade and migration paths are:

• CDP Upgrade and Migration paths for Data-at-Rest• CDP Upgrade and Migration paths to CDP for CDF components• CDP Upgrade and Migration paths for CDSW (Machine Learning)

CDP Upgrade and Migration Paths for Data-at-Rest

If you are an HDP or a CDH user, you can follow one of the several upgrade or migration paths to CDP.

Migration and Upgrade Paths Description

In-place upgrade Recommended for large clusters. Other paths are notviable. Involves downtime. See the following to knowmore about the In-place upgrade paths:

• In-place upgrade from CDH to CDP Private CloudBase

• In-place upgrade from HDP to CDP Private CloudBase

4

Page 5: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager CDP Upgrade and Migrations Paths

Migration and Upgrade Paths Description

Migrate to new cluster Recommended if you are ready for a hardware refreshor have small clusters. You can fall back to youroriginal cluster in the event of upgrade issues. Requiresadditional hardware.

Direct to cloud migration Recommended if you can tolerate some clusterdowntime or have bursty workloads. See the following toknow more about the direct to cloud migration paths:

• Nifi workloads

If you are running NiFi workloads without Hive,Impala, HBase, or Kafka: Migrate to CDP Data Huband use the Flow Management cluster template.

• Kafka workloads

If you are running Kafka workloads without Hive,Impala, HBase, or NiFi: Migrate to CDP Data Huband use the Streams Messaging cluster template.

• HBase workloads

If you are running HBase workloads without Hiveor Impala: Migrate to CDP Data Hub and use theOperational Database cluster template.

• Hive or Impala workloads

If you are running Hive or Impala workloads withoutHBase: Migrate to Cloudera Data Warehouse.

• Other workloads

Migrate to CDP Data Hub and use the custom clustertemplate.

CDP Upgrade and Migration Paths to CDP for CDF components

You can upgrade or migrate CDF components to CDP in the following ways:

What do you want to migrate? Do you want to migrate or upgrade?

Streaming workloads • In-place upgrade from CDH• In-place upgrade from HDP• Migrating Streaming workloads from HDF to CDP

Flow Management workloads • In-place upgrade from CFM for CDH to CFM 2.0.xfor CDP

• Migrating from HDF or standalone CFM 1.x to CFMfor CDP

Workloads for deprecated components • Migrate Flume workloads to NiFi for CDP. Thisworkflow is in development.

• Migrate Storm workloads to Flink for CDP. Thiscontent is in development.

CDP Migration and Upgrade Paths for CDSW (Machine Learning)

You can upgrade or migrate CDSW to CDP Machine Learning in the following ways:

5

Page 6: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager CDP PVC Base new features

What do you want to migrate? Do you want to migrate or upgrade?

CDSW with CDH/HDP to CDSW with CDP PrivateBase

For in-place upgrade follow the upgrade documentation.For migration follow the documented migration steps.

CDSW to CML-Public Cloud The workflow and tools to migrate from CDSW to CMLin CDP Public Cloud are in development. To performthis migration now, contact your Cloud account orprofessional services representative.

CDSW to CML-Private Cloud The workflow and tools to migrate from CDSW to CMLin CDP Private Cloud are in development. To performthis migration now, contact your Cloudera account orprofessional services representative.

CDP PVC Base new features

If you are a CDH or a HDP user, review the new features available in CDP Private Cloud Base, in addition to thefeatures carried over to CDP from the CDH and HDP releases.

• For CDH to CDP users• For HDP to CDP users

CDH to CDP

• Ranger 2.0

• Dynamic row filtering and column masking• Attribute-based access control and SparkSQL fine-grained access control• Sentry to Ranger migration tools• New RMS provide HDFS ACL Sync

• Atlas 2.0

• Business Metadata support by providing entity model extensions• Bulk import of Business Metadata attribute associations and glossary terms• Enhanced basic search and filtered search• Multi-tenancy support and easier administration with enhanced UI• Data lineage and chain of custody• Advanced data discovery and business glossary• Navigator to atlas migration• Improved performance and scalability• Integrating Ozone with Apache Atlas

6

Page 7: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager CDP PVC Base new features

• Hive 3

• Hive-on-Tez for better ETL performance• Support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions• Comprehensive ANSI 2016 SQL coverage• Support major performance improvements• Query result cache• Surrogate keys• Materialized views• Scheduled Queries, Rebuilding Materialized Views Automatically Using SQL• Auto-translation for Spark-Hive reads, no HWC session needed• Hive Warehouse Connector Spark direct reads• Authorization of external file writes from Spark• Improved CBO and vectorization coverage

• Ozone

• 10x scalability of HDFS• Support for Billion objects and S3 native support• Support dense data nodes• Fast restarts and easy maintenance

• HBase

• HBase-Spark connector• Re-architected Medium Size Objects (MOB) for better compaction and performance

• Hue

• Gateway-based SSO with Knox• Support for Ranger KMS-Key Trustee Integration

• Kudu

• Fine grained Authorization with Ranger• Support for Knox• Operation enhancements with rolling restarts and automatic rebalancing• Lots of usability improvements• Addition of new data types like DATE, VARCHAR and support for HybridClock timestamp

• YARN

• New Yarn Queue Manager• Placement Rules enable you to submit jobs without specifying the queue name• Capacity Scheduler leverages Delay Scheduling to honor task locality constraints• Preemption allows applications of higher priority to preempt applications of lower priority• Same queue name under different hierarchies• Moving applications between queues• Yarn Absolute Mode support

This is a generic service-level architecture of the components in the CDH stack. Components in “ClouderaApplications”, “Operations and Management”, and “Encryption” boxes run outside of the cluster envelope defined inthe CDH Cluster Services perimeter.

7

Page 8: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager CDP PVC Base new features

Components marked with a red “X” are either deprecated and removed or replaced with an alternative component inCDP. These changes are noted in the CDP Cluster Architecture slide.

8

Page 9: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager CDP PVC Base new features

The service changes from CDH to CDP are:

• Flume to Cloudera Data Flow• Navigator to Ranger/Atlas• Sentry to Ranger• KeytrusteeKMS to RangerKMS• HSM KMS to Key HSM• Hive-on-Spark/MR to Hive-on-Tez• YARN Fairshare to YARN Capacity• Spark 1.6 to Spark 2.4• NavOpt to WorkloadXM• Pig to Hive or Spark

HDP to CDP

• Cloudera Manager

• Virtual private clusters• Automated wire encryption setup• Fine-grained role-based access control (RBAC) for administrators• Streamlined maintenance workflows

• Solr 8.4

• Relevance-based text search over unstructured data (text, pdf, .jpg,.)

9

Page 10: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supported In-Place Upgrade Paths

• Impala

• Better fit for Data Mart migration use cases (interactive, BI style queries)• Ability to query high volumes of data ("big data") in large clusters• Distributed queries in a cluster environment, for convenient scaling• Integration with Kudu for fast data and Ranger for authorization policies• Fast BI queries enable using a single system for big data processing and analytics, so customers avoid costly

modeling and ETL to add analytics on to the data lake.

• Hue

• Built-in SQL editor with intelligent query auto complete• Share query, chart results, and download for any database• Easily search, glance, and import datasets or jobs

• Kudu

• Better ingest and query performance for fast changing / updateable data. Reporting with update supportthrough Kudu and Impala

• Real-time and streaming applications with Kudu + Spark• Time series analytics, event analytics and real time data warehouse best Querying Experience with the most

intelligent autocompletes

• YARN

• Tools to transition to Capacity Scheduler• New Yarn Queue Manager• Capacity Scheduler leverages Delay Scheduling to honor task locality constraints• Preemption allows applications of higher priority to preempt applications of lower priority• Same queue name under different hierarchies• Moving applications between queues• Yarn Absolute Mode support

• Encryption

• Auto-TLS feature automates all the steps required to enable TLS encryption• Ranger KMS integration with Key Trustee Server to provide additional key provider storage• Encryption at rest with NavEncrypt

Supported In-Place Upgrade Paths

Supported upgrade paths for CDH, HDP, and CDP Private Cloud Base.

Supported Upgrades from CDH to CDP Private Cloud Base

• Upgrade from Cloudera Manager/CDH 5.13.x - 5.16.x to CDP Private Cloud Base 7.1. This will upgrade CDHand Cloudera Manager to the following versions:

• Cloudera Runtime 7.1.1 or higher• Cloudera Manager 7.1.1 or higher

• Upgrades from Cloudera Manager/CDH 5.0 - 5.12 to CDP Private Cloud Base require that you first upgrade toCloudera Manager/CDH 5.13 or higher.

• Upgrades from CDH 6.0 and higher to CDP Private Cloud Base 7.1 are not supported. You can, however createa new CDP Private Cloud Base 7.1 cluster and then copy your data to the new cluster. See Using ReplicationManager to migrate to CDP Private Cloud Base from CDH on page 34.

• Kafka upgrades require that you are running CDK 4.1. If you are running an earlier version of CDK, you mustfirst upgrade to CDK 4.1 and then perform the upgrade from CDH to CDP Private Cloud Base.

10

Page 11: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supported In-Place Upgrade Paths

• Upgrades from Cloudera Director are not supported.

Upgrades from HDP to CDP Private Cloud Base

• You can upgrade from HDP 2.6.5 and Ambari 2.6.2.x to CDP Private Cloud Base 7.1. This upgrade requiresseveral major steps, including upgrading to an interim version of Ambari. After the upgrade, your cluster will bemanaged by Cloudera Manager and the components will be upgraded to Cloudera Runtime 7.1.1 or higher.

• Upgrades from HDP 3.x to CDP Private Cloud Base are not currently supported.

Upgrades from CDP Private Cloud Base 7.0 to 7.1

• You can upgrade from Cloudera Manager 7.0.3 or higher to Cloudera Manager 7.1.1 or higher.• You can upgrade from Cloudera Runtime 7.0.3 or higher to Cloudera Runtime 7.1.1 or higher.

Upgrades for CDSW in CDP Private Cloud Base 7.1

The version of CDSW must be at 1.7.1 to upgrade CDSW for use with CDP Private Cloud Base.

Upgrades from CDH to higher versions of CDH

Current Cloudera Manager Version Supported Cloudera Manager UpgradePaths

Supported CDH Upgrade Paths

6.3 • 6.3.x maintenance releases

Note: Upgrades to ClouderaManager 6.3.3 or higher nowrequire a Cloudera Enterpriselicense file and a username andpassword. See Upgrading theCloudera Manager Server.

• 6.3.x maintenance releases

Note: Upgrades to CDH6.3.3 or higher now require aCloudera Enterprise license fileand a username and password.See Upgrading the ClouderaManager Server.

• 6.2.x• 6.1.x• 6.0.x• Any minor version of CDH 5.7 - 5.16

6.2 • 6.3.x• 6.2.x maintenance releases

• 6.2.x maintenance releases• 6.1.x• 6.0.x• Any minor version of CDH 5.7 - 5.16

6.1 • 6.3.x• 6.2.x• 6.1.x maintenance releases

• 6.1.x maintenance releases• 6.0.x• Any minor version of CDH 5.7 - 5.16

6.0 • 6.3.x• 6.2.x• 6.1.x• 6.0.x maintenance releases

• 6.0.x maintenance releases• Any minor version of CDH 5.7 - 5.14

6.0 Beta release No supported upgrades. No supported upgrades.

5.16 • 6.3.x• 6.2.x• 6.1.x• 5.16.x maintenance releases

• 5.16.x maintenance releases• Any minor version of that is lower or

equal to the Cloudera Manager version.

5.15 • 6.3.x• 6.2.x• 6.1.x• 5.16.x• 5.15.x maintenance releases

• 5.15.x maintenance releases• Any minor version of CDH 5 that is lower

or equal to the Cloudera Manager version.

11

Page 12: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supported In-Place Upgrade Paths

Current Cloudera Manager Version Supported Cloudera Manager UpgradePaths

Supported CDH Upgrade Paths

5.7 - 5.14 • 6.3.x• 6.2.x• 6.1.x• 6.0• 5.7.x - 5.16

• Any minor version of CDH that is loweror equal to the Cloudera Manager version.

5.0 - 5.6 • 5.16 or lower • Any minor version of CDH 5.0 - 5.6 thatis lower or equal to the Cloudera Managerversion.

The versions of Cloudera Runtime and CDH clusters that can be managed by Cloudera Manager are limited to thefollowing:

Cloudera Manager support for CDH

Cloudera Manager Version Supported CDH/Runtime versions

7.1.4 • Cloudera Runtime 7.1.4• Cloudera Runtime 7.1.3• Cloudera Runtime 7.1.2• Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.1.3 • Cloudera Runtime 7.1.3• Cloudera Runtime 7.1.2• Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.1.2 • Cloudera Runtime 7.1.2• Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.1.1 • Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.0.3 • Cloudera Runtime 7.0.3

12

Page 13: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supported In-Place Upgrade Paths

Cloudera Manager Version Supported CDH/Runtime versions

6.3 • CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.7 - 5.16

6.2 • CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.7 - 5.16

6.1 • CDH 6.1• CDH 6.0• CDH 5.7 - 5.16

6.0 • CDH 6.0• CDH 5.7 - 5.14

5.16.x or lower Any minor version of CDH 5.0 - 5.16 that is lower or equal to theCloudera Manager version.

Cloudera Director

Upgrades from Cloudera Director are not supported.

Cloudera Manager support for Cloudera Runtime and CDHDescribes which versions of CDH or Cloudera Runtime are supported by Cloudera Manager.

The versions of Cloudera Runtime and CDH clusters that can be managed by Cloudera Manager are limited to thefollowing:

Cloudera Manager Version Supported CDH/Runtime versions

7.3.1 • Cloudera Runtime 7.1.6• Cloudera Runtime 7.1.5• Cloudera Runtime 7.1.4• Cloudera Runtime 7.1.3• Cloudera Runtime 7.1.2• Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.2.4 • Cloudera Runtime 7.1.5• Cloudera Runtime 7.1.4• Cloudera Runtime 7.1.3• Cloudera Runtime 7.1.2• Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

13

Page 14: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Assessing the Impact of Apache Hive

Cloudera Manager Version Supported CDH/Runtime versions

7.1.4 • Cloudera Runtime 7.1.4• Cloudera Runtime 7.1.3• Cloudera Runtime 7.1.2• Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.1.3 • Cloudera Runtime 7.1.3• Cloudera Runtime 7.1.2• Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.1.2 • Cloudera Runtime 7.1.2• Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.1.1 • Cloudera Runtime 7.1.1• Cloudera Runtime 7.0.3• CDH 6.3• CDH 6.2• CDH 6.1• CDH 6.0• CDH 5.13 - 5.16

7.0.3 • Cloudera Runtime 7.0.3

Assessing the Impact of Apache Hive

Apache Hive featuresMajor changes to Apache Hive 2.x improve Apache Hive 3.x transactions and security. Knowing the majordifferences between these versions is critical for SQL users, including those who use Apache Spark and ApacheImpala.

Hive is a data warehouse system for summarizing, querying, and analyzing huge, disparate data sets.

ACID transaction processing

Hive 3 tables are ACID (Atomicity, Consistency, Isolation, and Durability)-compliant. Hive 3 write and readoperations improve the performance of transactional tables. Atomic operations include simple writes and inserts,

14

Page 15: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Assessing the Impact of Apache Hive

writes to multiple partitions, and multiple inserts in a single SELECT statement. A read operation is not affected bychanges that occur during the operation. You can insert or delete data, and it remains consistent throughout softwareand hardware crashes. Creation and maintenance of Hive tables is simplified because there is no longer any need tobucket tables.

Materialized views

Because multiple queries frequently need the same intermediate roll up or joined table, you can avoid costly,repetitious query portion sharing, by precomputing and caching intermediate tables into views.

Query results cache

Hive filters and caches similar or identical queries. Hive does not recompute the data that has not changed. Cachingrepetitive queries can reduce the load substantially when hundreds or thousands of users of BI tools and web servicesquery Hive.

Scheduled Queries

Using SQL statements, you can schedule Hive queries to run on a recurring basis, monitor query progress,temporarily ignore a query schedule, and limit the number running in parallel. You can use scheduled queries to startcompaction and periodically rebuild materialized views, for example.

Spark integration with Hive

Spark and Hive tables interoperate using the Hive Warehouse Connector and Spark Direct Reader to access ACIDmanaged tables. You can access external tables from Spark directly using SparkSQL.

You do not need HWC to read or write Hive external tables. Spark users just read from or write to Hive directly. Youcan read Hive external tables in ORC or Parquet formats. You can write Hive external tables in ORC format only.

Security improvements

Apache Ranger secures Hive data by default. To meet demands for concurrency improvements, ACID support, rendersecurity, and other features, Hive tightly controls the location of the warehouse on a file system, or object store, andmemory resources.

With Apache Ranger and Apache Hive ACID support, your organization will be ready to support and implementGDPR (General Data Protection Regulation).

Workload management at the query level

You can configure who uses query resources, how much can be used, and how fast Hive responds to resourcerequests. Workload management can improve parallel query execution, cluster sharing for queries, and queryperformance. Although the names are similar, Hive workload management queries are unrelated to ClouderaWorkload XM for reporting and viewing millions of queries and hundreds of databases.

Connection Pooling

Hive supports HakariCP JDBC connection pooling.

Unsupported features

CDP does not support the following features that were available in HDP and CDH platforms:

• CREATE TABLE that specifies a managed table location

Do not use the LOCATION clause to create a managed table. Hive assigns a default location in the warehouse tomanaged tables.

15

Page 16: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Assessing the Impact of Apache Hive

• CREATE INDEX

Hive builds and stores indexes in ORC or Parquet within the main table, instead of a different table, automatically.Set hive.optimize.index.filter to enable use (not recommended--use materialized views instead).Existing indexes are preserved and migrated in Parquet or ORC to CDP during upgrade.

Related InformationBlog: Enabling high-speed Spark direct reader for Apache Hive ACID tables

Apache Hive 3 architectural overviewUnderstanding Apache Hive 3 major design features, such as default ACID transaction processing, can help you useHive to address the growing needs of enterprise data warehouse systems.

Data storage and access control

One of the major architectural changes to support Hive 3 design gives Hive much more control over metadatamemory resources and the file system, or object store. The following architectural changes from Hive 2 to Hive 3provide improved security:

• Tightly controlled file system and computer memory resources, replacing flexible boundaries: Definitiveboundaries increase predictability. Greater file system control improves security.

• Optimized workloads in shared files and YARN containers

Hive 3 is optimized for object stores in the following ways:

• Hive uses ACID to determine which files to read rather than relying on the storage system.• In Hive 3, file movement is reduced from that in Hive 2.• Hive caches metadata and data agressively to reduce file system operations

The major authorization model for Hive is Ranger. Hive enforces access controls specified in Ranger. This modeloffers stronger security than other security schemes and more flexibility in managing policies.

This model permits only Hive to access the Hive warehouse.

Transaction processing

You can deploy new Hive application types by taking advantage of the following transaction processingcharacteristics:

• Mature versions of ACID transaction processing:

ACID tables are the default table type.

ACID enabled by default causes no performance or operational overload.• Simplified application development, operations with strong transactional guarantees, and simple semantics for

SQL commands

You do not need to bucket ACID tables.• Materialized view rewrites• Automatic query cache• Advanced optimizations

Hive client changes

You can use the thin client Beeline for querying Hive from the command line. You can run Hive administrativecommands from the command line. Beeline uses a JDBC connection to Hive to execute commands. Hive parses,compiles, and execute operations. Beeline supports many of the command-line options that Hive CLI supported.Beeline does not support hive -e set key=value to configure the Hive Metastore.

16

Page 17: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Assessing the Impact of Apache Hive

You enter supported Hive CLI commands by invoking Beeline using the hive keyword, command option, andcommand. For example, hive -e set. Using Beeline instead of the thick client Hive CLI, which is no longersupported, has several advantages, including low overhead. Beeline does not use the entire Hive code base. A smallnumber of daemons required to run queries simplifies monitoring and debugging.

Hive enforces allowlist and denylist settings that you can change using SET commands. Using the denylist, you canrestrict memory configuration changes to prevent instability. Different Hive instances with different allowlists anddenylists to establish different levels of stability.

Apache Hive Metastore sharing

Hive, Impala, and other components can share a remote Hive metastore.

Spark integration

Spark and Hive tables interoperate using the Hive Warehouse Connector.

You can access ACID and external tables from Spark using the Hive Warehouse Connector. You do not need theHive Warehouse Connector to read Hive external tables from Spark and write Hive external tables from Spark. Youdo not need HWC to read or write Hive external tables. Spark users just read from or write to Hive directly. You canread Hive external tables in ORC or Parquet formats. You can write Hive external tables in ORC format only.

Query execution of batch and interactive workloads

You can connect to Hive using a JDBC command-line tool, such as Beeline, or using an JDBC/ODBC driver witha BI tool, such as Tableau. You configure the settings file for each instance to perform either batch or interactiveprocessing.

Related InformationBlog: Enabling high-speed Spark direct reader for Apache Hive ACID tables

Identifying semantic changes and workaroundsAs SQL Developer, Analyst, or other Hive user, you need to know potential problems with queries due to semanticchanges. Some of the operations that changed were not widely used, so you might not encounter any of the problemsassociated with the changes.

Over the years, Apache Hive committers enhanced versions of Hive supported in legacy releases of CDH and HDP,with users in mind. Changes were designed to maintain compatibility with Hive applications. Consequently, fewsyntax changes occurred over the years. A number of semantic changes, described in this section did occur, however.Workarounds are described for these semantic changes.

Casting timestampsResults of applications that cast numerics to timestamps differ from Hive 2 to Hive 3. Apache Hive changed thebehavior of CAST to comply with the SQL Standard, which does not associate a time zone with the TIMESTAMPtype.

Before Upgrade to CDP

Casting a numeric type value into a timestamp could be used to produce a result that reflected the time zone of thecluster. For example, 1597217764557 is 2020-08-12 00:36:04 PDT. Running the following query casts the numeric toa timestamp in PDT:

> SELECT CAST(1597217764557 AS TIMESTAMP); | 2020-08-12 00:36:04 |

After Upgrade to CDP

17

Page 18: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Assessing the Impact of Apache Hive

Casting a numeric type value into a timestamp produces a result that reflects the UTC instead of the time zone of thecluster. Running the following query casts the numeric to a timestamp in UTC.

> SELECT CAST(1597217764557 AS TIMESTAMP); | 2020-08-12 07:36:04.557 |

Action Required

Change applications. Do not cast from a numeral to obtain a local time zone. Built-in functions from_utc_timestamp and to_utc_timestamp can be used to mimic behavior before the upgrade.

Related InformationApache Hive web site summary of timestamp semantics

Changing incompatible column typesA default configuration change can cause applications that change column types to fail.

Before Upgrade to CDP

In HDP 2.x and CDH 5.x hive.metastore.disallow.incompatible.col.type.changes is false bydefault to allow changes to incompatible column types. For example, you can change a STRING column to a columnof an incompatible type, such as MAP<STRING, STRING>. No error occurs.

After Upgrade to CDP

In CDP, hive.metastore.disallow.incompatible.col.type.changes is true by default. Hiveprevents changes to incompatible column types. Compatible column type changes, such as INT, STRING, BIGINT,are not blocked.

Action Required

Change applications to disallow incompatible column type changes to prevent possible data corruption. CheckALTER TABLE statements and change those that would fail due to incompatible column types.

Related InformationHIVE-12320

Creating tablesTo improve useability and functionality, Hive 3 significantly changed table creation.

Hive has changed table creation in the following ways:

• Creates ACID-compliant table, which is the default in CDP• Supports simple writes and inserts• Writes to multiple partitions• Inserts multiple data updates in a single SELECT statement• Eliminates the need for bucketing.

If you have an ETL pipeline that creates tables in Hive, the tables will be created as ACID. Hive now tightly controlsaccess and performs compaction periodically on the tables. The way you access managed Hive tables from Spark andother clients changes. In CDP, access to external tables requires you to set up security access permissions.

Before Upgrade to CDP

In CDH and HDP 2.x, by default CREATE TABLE created a non-ACID table.

After Upgrade to CDP

In CDP, by default CREATE TABLE creates a full, ACID transactional table in ORC format.

Action Required

Perform one or more of the following actions:

18

Page 19: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Assessing the Impact of Apache Hive

• Configure legacy CREATE TABLE behavior (see the next section) to create external tables by default.• To read Hive ACID tables from Spark, you connect to Hive using the Hive Warehouse Connector (HWC) or the

HWC Spark Direct Reader. To write ACID tables to Hive from Spark, you use the HWC and HWC API. Sparkcreates an external table with the purge property when you do not use the HWC API. For more information, seeHWC Spark Direct Reader and Hive Warehouse Connector.

• Set up Ranger policies and HDFS ACLs for tables. For more information, see HDFS ACLs and HDFS ACLPermissions.

Handling table reference syntaxFor ANSI SQL compliance, Hive 3.x rejects `db.table` in SQL queries as described by the Hive-16907 bug fix. Adot (.) is not allowed in table names. As a Data Engineer, you need to ensure that Hive tables do not contain thesereferences before migrating the tables to CDP, that scripts are changed to comply with the SQL standard references,and that users are aware of the requirement.

About this task

To change queries that use such `db.table` references thereby preventing Hive from interpreting the entire db.tablestring incorrectly as the table name, you enclose the database name and the table name in backticks as follows:

A dot (.) is not allowed in table names.

Procedure

1. Find a table having the problematic table reference.For example, math.students appears in a CREATE TABLE statement.

2. Enclose the database name and the table name in backticks.

CREATE TABLE `math`.`students` (name VARCHAR(64), age INT, gpa DECIMAL(3,2));

Add Backticks to Table ReferencesCDP includes the Hive-16907 bug fix, which rejects `db.table` in SQL queries. A dot (.) is not allowed in tablenames. You need to change queries that use such references to prevent Hive from interpreting the entire db.tablestring as the table name.

Procedure

1. Find a table having the problematic table reference.

math.students

appears in a CREATE TABLE statement.

2. Enclose the database name and the table name in backticks.

CREATE TABLE `math`.`students` (name VARCHAR(64), age INT, gpa DECIMAL(3,2));

Handling the keyword APPLICATIONIf you use the keyword APPLICATION in your queries, you might need to modify the queries to prevent failure.

To prevent a query that uses a keyword from failing, enclose the query in backticks.

Before Upgrade to CDP

19

Page 20: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Assessing the Impact of Apache Hive

In CDH releases, such as CDH 5.13, queries that use the word APPLICATION in queries execute successfully. Forexample, you could use this word as a table name.

> select f1, f2 from application

After Upgrade to CDP

A query that uses the keyword APPLICATION fails.

Action Required

Change applications. Enclose queries in backticks. For example, SELECT field1, field2 FROM `application`;

Disabling partition type checkingAn enhancement in Hive 3 checks the types of partitions. This feature can be disabled by setting a property. For moreinformation, see the ASF Apache Hive Language Manual.

Before Upgrade to CDP

In CDH 5.x, partition values are not type checked.

After Upgrade to CDP

Partition values specified in the partition specification are type checked, converted, and normalized to conform totheir column types if the property hive.typecheck.on.insert is set to true (default). The values can benumbers.

Action Required

If type checking of partitions causes problems, disable the feature. To disable partition type checking, set hive.typecheck.on.insert to false. For example:

SET hive.typecheck.on.insert=false;

Related InformationHive Language Manual: Alter Partition

Dropping partitionsThe OFFLINE and NO_DROP keywords in the CASCADE clause for dropping partitions causes performanceproblems and is no longer supported.

Before Upgrade to CDP

You could use OFFLINE and NO_DROP keywords in the DROP CASCADE clause to prevent partitions from beingread or dropped.

After Upgrade to CDP

OFFLINE and NO_DROP are not supported in the DROP CASCADE clause.

Action Required

Change applications to remove OFFLINE and NO_DROP from the DROP CASCADE clause. Use an authorizationscheme, such as Ranger, to prevent partitions from being dropped or read.

Granting Permissions to RolesIn CDH, the ROLE/GROUP semantics are different from those semantics in CDP. Hive 3 requires a tightly controlledfile system and computer memory resources, replacing flexible boundaries that earlier Hive versions allowed.

Definitive boundaries increase predictability. Greater file system control improves security. This model offersstronger security than other security schemes and better policy management.

Before Upgrade to CDP

20

Page 21: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Assessing the Impact of Apache Hive

In CDH, Sentry was recommended for CDH policy management. CDH supported GRANT ON ROLE semantics.

After Upgrade to CDP

The major authorization model in Hive 3 is Ranger, not Sentry. If migrating from CDH, move away from Sentrytoward Apache Ranger. GRANT ON ROLE semantics are not supported.

Action Required

Use GRANT semantics supported in CDP, for example, to set up file system permissions:

GRANT <permissions> ON TABLE <table> TO USER <user or group>;

Use the semantics described in Configuring a resource-based policy: Hive.

Related InformationSentry to Ranger Permissions

Apache Software Foundation HDFS Permissions Guide

Apache Hive 3 Architectural Overview

Configure a Resource-based Policy: Hive

Handling output of greatest and least functionsTo calculate the greatest (or least) value in a column, you need to work around a problem that occurs when thecolumn has a NULL value.

Before Upgrade to CDP

The greatest function returned the highest value of the list of values. The least function returned the lowest value ofthe list of values.

After Upgrade to CDP

Returns NULL when one or more arguments are NULL.

Action Required

Use NULL filters or the nvl function on the columns you use as arguments to the greatest or least functions.

SELECT greatest(nvl(col1,default value incase of NULL),nvl(col2,default value incase of NULL));

Renaming tablesTo harden the system, Hive data can be stored in HDFS encryption zones. RENAME has been changed to preventmoving a table outside the same encryption zone or into a no-encryption zone.

Before Upgrade to CDP

In CDH and HDP, renaming a managed table moves its HDFS location.

After Upgrade to CDP

Renaming a managed table moves its location only if the table is created without a LOCATION clause and is underits database directory.

Action Required

None

Hive unsupported interfaces and featuresYou need to know the interfaces available in HDP or CDH platforms that are not supported.

The following interfaces are not supported in CDP Private Cloud Base:

21

Page 22: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

• Druid• Hcat CLI• Hive CLI (replaced by Beeline)• Hive View• LLAP• MapReduce execution engine (replaced by Tez)• Pig• S3 for storing tables• Spark execution engine (replaced by Tez)• Spark thrift server

Spark and Hive tables interoperate using the Hive Warehouse Connector.• SQL Standard Authorization• Tez View• WebHCat

You can use Hue in lieu of Hive View.

Partially unsupported interfaces

Apache Hadoop Distributed Copy (DistCP) is not supported for copying Hive ACID tables. See link below.

Unsupported features

HiveServer (HS2) high availability (HA) is provided through Knox in CDP Private Cloud Base 7.1.6 or later. In anyrelease of CDP Private Cloud Base, if you don't use Knox, you can use ZooKeeper Discovery Service.

Supplemental Upgrade Topics

Additional topics to help with special situations.

Configuring a Local Package RepositoryYou can create a package repository for Cloudera Manager either by hosting an internal web repository or bymanually copying the repository files to the Cloudera Manager Server host for distribution to Cloudera ManagerAgent hosts.

Loading Filters ... 7.1.4 7.1.3 7.1.2 7.1.1 7.0.3 6.3.3 6.3.1 6.3.0 6.2.1 6.2.0 6.1.1 6.1.0 6.0.1 6.0.0 5.16 5.15 5.14 5.13 6.3.3 6.3.2 6.2.1 6.2.0 6.1.1 6.1.0 6.0.1 6.0.0 5.16 5.15 5.14 5.13 7.1.4 7.1.3 7.1.2 7.1.1 7.0.3 7.1.4 7.1.3 7.1.2 7.1.17.0.3

Important: Select a supported operating system for the versions of Cloudera Manager or CDH that you aredownloading. See CDH and Cloudera Manager Supported Operating Systems.

Warning: The operating system you selected is not supported. Please select a supported operating system forthis upgrade. See Operating System Requirements.

Warning: The operating system you selected is not supported. Please select a supported operating system forthis upgrade. See Operating System Requirements.

Warning: The operating system you selected is not supported. Please select a supported operating system forthis upgrade. See Operating System Requirements.

22

Page 23: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

Warning: The operating system you selected is not supported. Please select a supported operating system forthis upgrade. See Operating System Requirements.

Warning: Upgrades from Cloudera Manager 5.12 and lower to Cloudera Manager 7.1.1 or higher are notsupported

Warning: You cannot upgrade from a cluster that uses Oracle 12.

Warning: You cannot upgrade from a cluster that uses Oracle 19.

Warning: For upgrades from CDH 5 clusters with Cloudera Navigator to Cloudera Runtime 7.1.1 (orhigher) clusters where Navigator is to be migrated to Apache Atlas, the cluster must have Kerberos enabledbefore upgrading.

Warning: For upgrades from CDH 5 clusters with Sentry to Cloudera Runtime 7.1.1 clusters where Sentryprivileges are to be transitioned to Apache Ranger, the cluster must have Kerberos enabled before upgrading.

Note: If the cluster you are upgrading will include Atlas, Ranger, or both, the upgrade wizard deploys oneinfrastructure Solr service to provide a search capability of the audit logs through the Ranger Admin UI and/or to store and serve Atlas metadata. Cloudera recommends that you do not use this service for customerworkloads to avoid interference with audit and timeline performance.

Creating a Permanent Internal Repository

The following sections describe how to create a permanent internal repository using Apache HTTP Server:

Setting Up a Web server

To host an internal repository, you must install or use an existing Web server on an internal host that is reachableby the Cloudera Manager host, and then download the repository files to the Web server host. The examples in thissection use Apache HTTP Server as the Web server. If you already have a Web server in your organization, you canskip to Downloading and Publishing the Package Repository for Cloudera Manager on page 24.

1. Install Apache HTTP Server:RHEL / CentOS

sudo yum install httpd

SLES

sudo zypper install httpd

Ubuntu

sudo apt-get install httpd

2. Start Apache HTTP Server:RHEL 7

sudo systemctl start httpd

SLES 12, Ubuntu 16 or later

sudo systemctl start apache2

23

Page 24: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

Downloading and Publishing the Package Repository for Cloudera Manager

1. Download the package repository for the product you want to install:Cloudera Manager 7

To download the files for a Cloudera Manager release, download the repository tarball for youroperating system. Then unpack the tarball, move the files to the web server directory, and modifyfile permissions.

sudo mkdir -p /var/www/html/cloudera-repos/cm7

wget https://[username]:[password]@archive.cloudera.com/p/cm7/7.0.3/repo-as-tarball/cm7.0.3-redhat7.tar.gz

tar xvfz cm7.0.3-redhat7.tar.gz -C /var/www/html/cloudera-repos/cm7 --strip-components=1

sudo chmod -R ugo+rX /var/www/html/cloudera-repos/cm7

2. Visit the Repository URL http://<web_server>/cloudera-repos/ in your browser and verify the filesyou downloaded are present. If you do not see anything, your Web server may have been configured to not showindexes.

Creating a Temporary Internal Repository

You can quickly create a temporary remote repository to deploy packages on a one-time basis. Cloudera recommendsusing the same host that runs Cloudera Manager, or a gateway host. This example uses Python SimpleHTTPServer asthe Web server to host the /var/www/html directory, but you can use a different directory.

1. Download the repository you need following the instructions in Downloading and Publishing the PackageRepository for Cloudera Manager on page 24.

2. Determine a port that your system is not listening on. This example uses port 8900.3. Start a Python SimpleHTTPServer in the /var/www/html directory:

cd /var/www/htmlpython -m SimpleHTTPServer 8900

Serving HTTP on 0.0.0.0 port 8900 ...

4. Visit the Repository URL http://<web_server>:8900/cloudera-repos/ in your browser and verify thefiles you downloaded are present.

Configuring Hosts to Use the Internal Repository

After establishing the repository, modify the client configuration to use it:

OS Procedure

RHEL compatible Create /etc/yum.repos.d/cloudera-repo.repo files on cluster hosts with the following content,where <web_server> is the hostname of the Web server:

[cloudera-repo]name=cloudera-repobaseurl=http://<web_server>/cm/5enabled=1gpgcheck=0

24

Page 25: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

OS Procedure

SLES Use the zypper utility to update client system repository information by issuing the following command:

zypper addrepo http://<web_server>/cm <alias>

Ubuntu Create /etc/apt/sources.list.d/cloudera-repo.list files on all cluster hosts with thefollowing content, where <web_server> is the hostname of the Web server:

deb http://<web_server>/cm <codename> <components>

You can find the <codename> and <components> variables in the ./conf/distributions file in therepository.

After creating the .list file, run the following command:

sudo apt-get update

Configuring a Local Parcel RepositoryYou can create a parcel repository for Cloudera Manager either by hosting an internal Web repository or by manuallycopying the repository files to the Cloudera Manager Server host for distribution to Cloudera Manager Agent hosts.

Loading Filters ... 7.1.4 7.1.3 7.1.2 7.1.1 7.0.3 6.3.3 6.3.1 6.3.0 6.2.1 6.2.0 6.1.1 6.1.0 6.0.1 6.0.0 5.16 5.15 5.14 5.13 6.3.3 6.3.2 6.2.1 6.2.0 6.1.1 6.1.0 6.0.1 6.0.0 5.16 5.15 5.14 5.13 7.1.4 7.1.3 7.1.2 7.1.1 7.0.3 7.1.4 7.1.3 7.1.2 7.1.17.0.3

Using an Internally Hosted Remote Parcel Repository

The following sections describe how to use an internal Web server to host a parcel repository:

Setting Up a Web Server

To host an internal repository, you must install or use an existing Web server on an internal host that is reachable bythe Cloudera Manager host, and then download the repository files to the Web server host. The examples on this pageuse Apache HTTP Server as the Web server. If you already have a Web server in your organization, you can skip toDownloading and Publishing the Parcel Repository on page 27.

1. Install Apache HTTP Server:RHEL / CentOS

sudo yum install httpd

SLES

sudo zypper install httpd

Ubuntu

sudo apt-get install httpd

25

Page 26: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

2. Warning: Skipping this step could result in an error message Hash verification failed when trying todownload the parcel from a local repository, especially in Cloudera Manager 6 and higher.

Edit the Apache HTTP Server configuration file (/etc/httpd/conf/httpd.conf by default) to add or editthe following line in the <IfModule mime_module> section:

AddType application/x-gzip .gz .tgz .parcel

If the <IfModule mime_module> section does not exist, you can add it in its entirety as follows:

Note: This example configuration was modified from the default configuration provided after installingApache HTTP Server on RHEL 7.

<IfModule mime_module> # # TypesConfig points to the file containing the list of mappings from # filename extension to MIME-type. # TypesConfig /etc/mime.types # # AddType allows you to add to or override the MIME configuration # file specified in TypesConfig for specific file types. # #AddType application/x-gzip .tgz # # AddEncoding allows you to have certain browsers uncompress # information on the fly. Note: Not all browsers support this. # #AddEncoding x-compress .Z #AddEncoding x-gzip .gz .tgz # # If the AddEncoding directives above are commented-out, then you # probably should define those extensions to indicate media types: # AddType application/x-compress .Z AddType application/x-gzip .gz .tgz .parcel

# # AddHandler allows you to map certain file extensions to "handlers": # actions unrelated to filetype. These can be either built into the server # or added with the Action directive (see below) # # To use CGI scripts outside of ScriptAliased directories: # (You will also need to add "ExecCGI" to the "Options" directive.) # #AddHandler cgi-script .cgi

# For type maps (negotiated resources): #AddHandler type-map var

# # Filters allow you to process content before it is sent to the client. # # To parse .shtml files for server-side includes (SSI): # (You will also need to add "Includes" to the "Options" directive.) # AddType text/html .shtml AddOutputFilter INCLUDES .shtml</IfModule>

26

Page 27: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

3. Start Apache HTTP Server:RHEL 7

sudo systemctl start httpd

SLES 12, Ubuntu 16 or later

sudo systemctl start apache2

Downloading and Publishing the Parcel Repository

1. Download manifest.json and the parcel files for the product you want to install:

To download the files for the latest Runtime 7 release, run the following commands on the Web server host:

sudo mkdir -p /var/www/html/cloudera-repossudo wget --recursive --no-parent --no-host-directories https://[username]:[password]@archive.cloudera.com/p/cdh7/7.0.3.0/parcels/ -P /var/www/html/cloudera-repos

sudo chmod -R ugo+rX /var/www/html/cloudera-repos/cdh7

2. Visit the Repository URL http://<Web_server>/cloudera-repos/ in your browser and verify the filesyou downloaded are present. If you do not see anything, your Web server may have been configured to not showindexes.

Configuring Cloudera Manager to Use an Internal Remote Parcel Repository

1. Use one of the following methods to open the parcel settings page:

• Navigation bar

a. Click the parcel icon in the top navigation bar or click Hosts and click the Parcels tab.b. Click the Configuration button.

• Menu

a. Select Administration > Settings.b. Select Category > Parcels.

2. In the Remote Parcel Repository URLs list, click the addition symbol to open an additional row.3. Enter the path to the parcel. For example: http://<web_server>/cloudera-parcels/

cdh7/7.2.8.0/

4. Enter a Reason for change, and then click Save Changes to commit the changes.

Using a Local Parcel Repository

To use a local parcel repository, complete the following steps:

1. Open the Cloudera Manager Admin Console and navigate to the Parcels page.2. Select Configuration and verify that you have a Local Parcel Repository path set. By default, the directory is /

opt/cloudera/parcel-repo.3. Remove any Remote Parcel Repository URLs you are not using, including ones that point to Cloudera archives.4. Add the parcel you want to use to the local parcel repository directory that you specified. For instructions on

downloading parcels, see Downloading and Publishing the Parcel Repository on page 27 above.5. In the command line, navigate to the local parcel repository directory.

27

Page 28: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

6. Create a SHA1 hash for the parcel you added and save it to a file named parcel_name.parcel.sha.

For example, the following command generates a SHA1 hash for the parcelCDH-6.1.0-1.cdh6.1.0.p0.770702-el7.parcel:

sha1sum CDH-6.1.0-1.cdh6.1.0.p0.770702-el7.parcel | awk '{ print $1 }' > CDH-6.1.0-1.cdh6.1.0.p0.770702-el7.parcel.sha

7. Change the ownership of the parcel and hash files to cloudera-scm:

sudo chown -R cloudera-scm:cloudera-scm /opt/cloudera/parcel-repo/*

8. In the Cloudera Manager Admin Console, navigate to the Parcels page.9. Click Check for New Parcels and verify that the new parcel appears.10. Download, distribute, and activate the parcel.

Transitioning from the Cloudera Manager Embedded PostgreSQLDatabase Server to an External PostgreSQL Database

Cloudera Manager provides an embedded PostgreSQL database server for demonstration and proof of conceptdeployments when creating a cluster. To remind users that this embedded database is not suitable for production,Cloudera Manager displays the banner text: "You are running Cloudera Manager in non-production mode, which usesan embedded PostgreSQL database. Switch to using a supported external database before moving into production."

If, however, you have already used the embedded database, and you are unable to redeploy a fresh cluster, then youmust migrate to an external PostgreSQL database.

Note: This procedure does not describe how to migrate to a database server other than PostgreSQL. Movingdatabases from one database server to a different type of database server is a complex process that requiresmodification of the schema and matching the data in the database tables to the new schema. It is stronglyrecommended that you engage with Cloudera Professional Services if you wish to perform a transition to anexternal database server other than PostgreSQL.

Prerequisites

Before migrating the Cloudera Manager embedded PostgreSQL database to an external PostgreSQL database, ensurethat your setup meets the following conditions:

• The external PostgreSQL database server is running.• The database server is configured to accept remote connections.• The database server is configured to accept user logins using md5.• No one has manually created any databases in the external database server for roles that will be migrated.

Note: To view a list of databases in the external database server (requires default superuser permission):

sudo -u postgres psql -l

• All health issues with your cluster have been resolved.

For details about configuring the database server, see Configuring and Starting the PostgreSQL Server.

Important: Only perform the steps in Configuring and Starting the PostgreSQL Server. Do not proceed withthe creation of databases as described in the subsequent section.

For large clusters, Cloudera recommends running your database server on a dedicated host. Engage ClouderaProfessional Services or a certified database administrator to correctly tune your external database server.

28

Page 29: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

Identify Roles that Use the Embedded Database Server

Before you can migrate to another database server, you must first identify the databases using the embedded databaseserver. When the Cloudera Manager Embedded Database server is initialized, it creates the Cloudera Managerdatabase and databases for roles in the Management Services. The Installation Wizard (which runs automatically thefirst time you log in to Cloudera Manager) or Add Service action for a cluster creates additional databases for roleswhen run. It is in this context that you identify which roles are used in the embedded database server.

To identify which roles are using the Cloudera Manager embedded database server:

1. Obtain and save the cloudera-scm superuser password from the embedded database server. You will need thispassword in subsequent steps:

head -1 /var/lib/cloudera-scm-server-db/data/generated_password.txt

2. Make a list of all services that are using the embedded database server. Then, after determining which services arenot using the embedded database server, remove those services from the list. The scm database must remain inyour list. Use the following table as a guide:

Table 1: Cloudera Manager Embedded Database Server Databases

Service Role Default Database Name Default Username

Cloudera Manager Server scm scm

Cloudera Management Service Activity Monitor amon amon

Hive Hive Metastore Server hive hive

Hue Hue Server hue 7uu7uu7uhue

Cloudera Management Service Navigator Audit Server nav nav

Cloudera Management Service Navigator Metadata Server navms navms

Oozie Oozie Server oozie_oozie_server oozie_oozie_server

Cloudera Management Service Reports Manager rman rman

Sentry Sentry Server sentry sentry

29

Page 30: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

3. Verify which roles are using the embedded database. Roles using the embedded database server always use port7432 (the default port for the embedded database) on the Cloudera Manager Server host.

For Cloudera Management Services:

a. Select Cloudera Management Service > Configuration, and type "7432" in the Search field.b. Confirm that the hostname for the services being used is the same hostname used by the Cloudera Manager

Server.

Note:

If any of the following fields contain the value "7432", then the service is using the embedded database:

• Activity Monitor• Navigator Audit Server• Navigator Metadata Server• Reports Manager

For the Oozie Service:

a. Select Oozie service > Configuration, and type "7432" in the Search field.b. Confirm that the hostname is the Cloudera Manager Server.

For Hive, Hue, and Sentry Services:

a. Select the specific service > Configuration, and type "database host" in the Search field.b. Confirm that the hostname is the Cloudera Manager Server.c. In the Search field, type “database port” and confirm that the port is 7432.d. Repeat these steps for each of the services (Hive, Hue and Sentry).

4. Verify the database names in the embedded database server match the database names on your list (Step 2).Databases that exist on the database server and not used by their roles do not need to be migrated. This step is toconfirm that your list is correct.

Note: Do not add the postgres, template0, or template1 databases to your list. These are usedonly by the PostgreSQL server.

psql -h localhost -p 7432 -U cloudera-scm -l

Password for user cloudera-scm: <password>

List of databases Name | Owner | Encoding | Collate | Ctype | Access--------------------+--------------------+----------+------------+------------+----------------amon | amon | UTF8 | en_US.UTF8 | en_US.UTF8 |hive | hive | UTF8 | en_US.UTF8 | en_US.UTF8 |hue | hue | UTF8 | en_US.UTF8 | en_US.UTF8 |nav | nav | UTF8 | en_US.UTF8 | en_US.UTF8 |navms | navms | UTF8 | en_US.UTF8 | en_US.UTF8 |oozie_oozie_server | oozie_oozie_server | UTF8 | en_US.UTF8 | en_US.UTF8 |postgres | cloudera-scm | UTF8 | en_US.UTF8 | en_US.UTF8 |rman | rman | UTF8 | en_US.UTF8 | en_US.UTF8 |

30

Page 31: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

scm | scm | UTF8 | en_US.UTF8 | en_US.UTF8 |sentry | sentry | UTF8 | en_US.UTF8 | en_US.UTF8 |template0 | cloudera-scm | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/"cloudera-scm"template1 | cloudera-scm | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/"cloudera-scm"(12 rows)

You should now have a list of all roles and database names that use the embedded database server, and are ready toproceed with the transition of databases from the embedded database server to the external PostgreSQL databaseserver.

Migrate Databases from the Embedded Database Server to the External PostgreSQLDatabase Server

While performing this procedure, ensure that the Cloudera Manager Agents remain running on all hosts. Unlessotherwise specified, when prompted for a password use the cloudera-scm password.

Note: After completing this transition, you cannot delete the cloudera-scm postgres superuserunless you remove the access privileges for the migrated databases. Minimally, you should change the cloudera-scm postgres superuser password.

1. In Cloudera Manager, stop the cluster services identified as using the embedded database server (see IdentifyRoles that Use the Embedded Database Server on page 29). Be sure to stop the Cloudera ManagementService as well. Also be sure to stop any services with dependencies on these services. The remaining serviceswill continue to run without downtime.

Note: If you do not stop the services from within Cloudera Manager before stopping Cloudera ManagerServer from the command line, they will continue to run and maintain a network connection to theembedded database server. If this occurs, then the embedded database server will ignore any commandline stop commands (Step 2) and require that you manually kill the process, which in turn causes theservices to crash instead of stopping cleanly.

2. Navigate to Hosts > All Hosts, and make note of the number of roles assigned to hosts. Also take note whether ornot they are in a commissioned state. You will need this information later to validate that your scm database wasmigrated correctly.

3. Stop the Cloudera Manager Server. To stop the server:

sudo service cloudera-scm-server stop

4. Obtain and save the embedded database superuser password (you will need this password in subsequent steps)from the generated_password.txt file:

head -1 /var/lib/cloudera-scm-server-db/data/generated_password.txt

5. Export the PostgreSQL user roles from the embedded database server to ensure the correct users, permissions, andpasswords are preserved for database access. Passwords are exported as an md5sum and are not visible in plaintext. To export the database user roles (you will need the cloudera-scm user password):

pg_dumpall -h localhost -p 7432 -U cloudera-scm -v --roles-only -f "/var/tmp/cloudera_user_roles.sql"

6. Edit /var/tmp/cloudera_user_roles.sql to remove any CREATE ROLE and ALTER ROLEcommands for databases not in your list. Leave the entries for cloudera-scm untouched, because this user roleis used during the database import.

31

Page 32: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

7. Export the data from each of the databases on your list you created in Identify Roles that Use the EmbeddedDatabase Server on page 29:

pg_dump -F c -h localhost -p 7432 -U cloudera-scm [database_name] > /var/tmp/[database_name]_db_backup-$(date +”%m-%d-%Y”).dump

Following is a sample data export command for the scm database:

pg_dump -F c -h localhost -p 7432 -U cloudera-scm scm > /var/tmp/scm_db_backup-$(date +%m-%d-%Y).dump

Password:

8. Stop and disable the embedded database server:

service cloudera-scm-server-db stopchkconfig cloudera-scm-server-db off

Confirm that the embedded database server is stopped:

netstat -at | grep 7432

9. Back up the Cloudera Manager Server database configuration file:

cp /etc/cloudera-scm-server/db.properties /etc/cloudera-scm-server/db.properties.embedded

10. Copy the file /var/tmp/cloudera_user_roles.sql and the database dump files from the embeddeddatabase server host to /var/tmp on the external database server host:

cd /var/tmpscp cloudera_user_roles.sql *.dump <user>@<postgres-server>:/var/tmp

11. Import the PostgreSQL user roles into the external database server.

The external PostgreSQL database server superuser password is required to import the user roles. If the superuserrole has been changed, you will be prompted for the username and password.

Note: Only run the command that applies to your context; do not execute both commands.

• To import users when using the default PostgreSQL superuser role:

sudo -u postgres psql -f /var/tmp/cloudera_user_roles.sql

• To import users when the superuser role has been changed:

psql -h <database-hostname> -p <database-port> -U <superuser> -f /var/tmp/cloudera_user_roles.sql

For example:

psql -h pg-server.example.com -p 5432 -U postgres -f /var/tmp/cloudera_user_roles.sql

Password for user postgres

32

Page 33: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

12. Import the Cloudera Manager database on the external server. First copy the database dump files from theCloudera Manager Server host to your external PostgreSQL database server, and then import the database data:

Note: To successfully run the pg_restore command, there must be an existing database on thedatabase server to complete the connection; the existing database will not be modified. If the -d <existing-database> option is not included, then the pg_restore command will fail.

pg_restore -C -h <database-hostname> -p <database-port> -d <existing-database> -U cloudera-scm -v <data-file>

Repeat this import for each database.

The following example is for the scm database:

pg_restore -C -h pg-server.example.com -p 5432 -d postgres -U cloudera-scm -v /var/tmp/scm_server_db_backup-20180312.dump

pg_restore: connecting to database for restorePassword:

13. Update the Cloudera Manager Server database configuration file to use the external database server. Edit the /etc/cloudera-scm-server/db.properties file as follows:

a. Update the com.cloudera.cmf.db.host value with the hostname and port number of the externaldatabase server.

b. Change the com.cloudera.cmf.db.setupType value from "EMBEDDED" to "EXTERNAL".14. Start the Cloudera Manager Server and confirm it is working:

service cloudera-scm-server start

Note that if you start the Cloudera Manager GUI at this point, it may take up to five minutes after executing thestart command before it becomes available.

In Cloudera Manager Server, navigate to Hosts > All Hosts and confirm the number of roles assigned to hosts(this number should match what you found in Step 2); also confirm that they are in a commissioned state thatmatches what you observed in Step 2.

15. Update the role configurations to use the external database hostname and port number. Only perform this task forservices where the database has been migrated.

For Cloudera Management Services:

a. Select Cloudera Management Service > Configuration, and type "7432" in the Search field.b. Change any database hostname properties from the embedded database to the external database hostname and

port number.c. Click Save Changes.

For the Oozie Service:

a. Select Oozie service > Configuration, and type "7432" in the Search field.b. Change any database hostname properties from the embedded database to the external database hostname and

port number.c. Click Save Changes.

For Hive, Hue, and Sentry Services:

a. Select the specific service > Configuration, and type "database host" in the Search field.b. Change the hostname from the embedded database name to the external database hostname.c. Click Save Changes.

16. Start the Cloudera Management Service and confirm that all management services are up and no health tests arefailing.

33

Page 34: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

17. Start all Services via the Cloudera Manager web UI. This should start all services that were stopped for thedatabase transition. Confirm that all services are up and no health tests are failing.

18. On the embedded database server host, remove the embedded PostgreSQL database server:

a. Make a backup of the /var/lib/cloudera-scm-server-db/data directory:

tar czvf /var/tmp/embedded_db_data_backup-$(date +”%m-%d-%Y”).tgz /var/lib/cloudera-scm-server-db/data

b. Remove the embedded database package:

For RHEL/SLES:

rpm --erase cloudera-manager-server-db-2

For Ubuntu:

apt-get remove cloudera-manager-server-db-2

c. Delete the /var/lib/cloudera-scm-server-db/data directory.

Using Replication Manager to migrate to CDP Private Cloud Base fromCDH

While using Replication Manager service you must follow these guidelines when migrating the CDH clusters.

• Ensure that the supported source and target clusters and the corresponding Cloudera Manager versions are in syncwith respect to the cluster configurations prior to starting the upgrade.

• Upgrade your target cluster to CDP Private Cloud Base first. Upgrading the target CDH cluster first ensures thatyour data on the source cluster is not corrupted or rendered invalid.

• Upgrade the source cluster to CDP Private Cloud Base after the data is migrated to the CDP Private Cloud Base(target). And, later verify that both source and target clusters are upgraded to CDP Private Cloud Base.

• For more information about upgrading CDH clusters, see Upgrading CDH to CDP Private Cloud Base.

Replication Manager considerations while using CDH clustersBefore you plan to upgrade your CDH clusters, make sure you are aware about the considerations pertaining to theReplication Manager service.

In a typical production environment where multiple replication schedules are underway, the cluster upgrade processdoes not interrupt the data movement experience (migration). The only exception is when you are running the Hivereplication policies or have any of them scheduled.

Attention: You cannot perform the cluster upgrade process while your Hive replication policies are runningor scheduled to commence shortly. Make sure to either abort or wait for the submitted Hive replicationpolicies to complete, before you commence the cluster upgrade process.

Cluster upgrade supported and unsupported featuresOnce you upgrade your clusters, note the supported and unsupported scenarios pertaining to Replication Managerservice.

Supported features

• For Sentry to Ranger replication, Cloudera Manager 6.3.1 is the supported version. Supports only the non-highavailability (HA) NameNode configurations.

• Replication works when the source cluster is (7.0.3 or 7.1.1).

34

Page 35: Getting Started with CDP Private Cloud Base Upgrade and

Cloudera Manager Supplemental Upgrade Topics

Unsupported features

• With Cloudera Manager 6.3.1, it currently does not support high availability NameNode configurations.• Does not support Ranger to Ranger policy replication.

35