drivescale-cloudera reference architecture · reference architecture ©2017 drivescale inc. all...

55
DriveScale-CLOUDERA Reference Architecture

Upload: others

Post on 03-Sep-2019

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

DriveScale-CLOUDERA Reference

Architecture

Page 2: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

2 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Table of Contents

1. Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2. Audience and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3. DriveScale | Cloudera Enterprise Solution Overview . . . . . . . . . . . . . . . . . . . . . . . 4

4. DriveScale Components Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.1 Hardware: DriveScale Adapter Chassis with DriveScale Adapters . . . . . . . . . . . . 6

4.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5. Reference Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.1 Physical Cluster Component List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.2 Logical Cluster Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5.3 Physical Cluster Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.4 Cluster Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.5 Enabling Hadoop Virtualization Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.6 Disk and Filesystem Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.7 OS Supportability/Compatibility Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.8 JBOD Supportability/Compatibility Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6. Rack Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Page 3: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

3 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

8. Bill of Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

10. Appendix A: Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

11. Appendix B: DriveScale Cluster Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

11.1 Configure Your Domains with DriveScale Central (DSC) . . . . . . . . . . . . . . . . . . 25

11.2 Set up DMS nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

11.3 Set up DriveScale Adapter (DSA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

11.4 Start the DMS and setup login to DMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

11.5 Set up Servers/DataNodes/MasterNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

11.6 Tagging JBOD and drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

11.7 Creating Server Nodes and Clusters from templates . . . . . . . . . . . . . . . . . . . . 30

12. Appendix B: Cloudera Manager Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

12.1 Cloudera Manager Installation Procedure for Reference Architecture . . . . . . 36

Page 4: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

R E F E R E N C E A R C H I T E C T U R E

©2017 DriveScale Inc. All Right Reserved.

DRIVESCALE-CLOUDERA

1. Executive Summary

This document is a high-level design reference architecture guide for implementing Cloudera Enterprise on a DriveScale solution with industry standard servers and JBOD.

The reference architecture introduces all the high-level hardware, and software that are included in the stack. Each high-level component is then described individually. This reference architecture does not describe the Cloudera data components or their applications.

DriveScale Technology Overview

DriveScale is leading the charge in bringing hyperscale computing capabilities to mainstream enterprises. Its compose-able data center architecture transforms rigid data centers into flexible and responsive scale-out deployments. Using DriveScale, data center administrators can deploy independent pools of commodity compute and storage resources, automatically discover available assets, and combine and recombine these resources as needed. The solution is provided through a set of on-premises and SaaS tools that coordinate between multiple levels of infrastructure. With DriveScale, Hadoop architects can more easily support Hadoop deployments of any size as well as other modern application workloads.

DriveScale provides hardware and software technology that allows separate deployment of compute and storage using commodity servers with minimal drives for Operating System and JBODs (Just a Bunch of Disks), with flexible binding of storage-to-compute resources in any ratio required by an application. As needs change, these bindings can be dissolved and reconfigured on demand, all under software control.

DriveScale technology acquires a deep understanding of the physical infrastructure and dynamics of a data center, which it uses to provide an integrated set of intelligence and automation tools to scale-out data center infrastructure to greatly simplify and optimize the data center’s operations.

2. Audience and Scope

This reference architecture guide is for Hadoop and IT architects who are responsible for the design and deployment of Cloudera Enterprise solutions on premises, as well as for Apache Hadoop administrators and architects and data center architects/engineers who collaborate with specialists in that space.

3. DriveScale | Cloudera Enterprise Solution Overview

Apache Hadoop is designed to address the ever so changing hardware requirements from customers for a more flexible and dynamic hardware infrastructure that provides significant cost and operational benefits. It is designed with composability as the primary goal, saving money, improving utilization and greatly simplifying the deployment of Hadoop clusters.

Hadoop is an Apache project being developed in the Java programming language by a global community of contributors. Yahoo!, has been the largest contributor to this project, and uses Apache Hadoop extensively across its businesses. Core committers on the Hadoop project include employees from Cloudera, eBay, Facebook, Getopt, Hortonworks, Huawei, IBM, InMobi, INRIA, LinkedIn, MapR,

4 of 55

Page 5: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

5 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Microsoft, Pivotal, Twitter, UC Berkeley, VMware, WANdisco, and Yahoo!, with contributions from many more individuals and organizations.

Although Hadoop is popular and widely used, installing, configuring, and running a production Hadoop cluster involves many concerns, including:

• Choosing the appropriate Hadoop software distribution and extensions

• Installing monitoring and management software

• Allocation of Hadoop services to physical nodes

• Selection of appropriate server hardware

• Rightsizing the storage configuration

• Implementing data locality

• Design of the network fabric

• Sizing and system scalability

• Overall performance

These concerns are complicated by the need to understand the workloads that will be running on the cluster, the fast-moving pace of the core Hadoop project, and the challenges to managing a system designed to scale to thousands of nodes in a single cluster.

The DriveScale | Cloudera Solution was designed by DriveScale in collaboration with Cloudera, and embodies all the hardware, software, resources and services needed to run Hadoop in a production environment. This end-to-end solution approach means that you can be in production with Hadoop in a shorter time than is typically possible with homegrown solutions. The solution is based on Cloudera Enterprise Data Hub 5.x (including Cloudera Distributed Hadoop), DriveScale hardware and software, industry standard servers, network switches and JBODs.

This solution includes components that span the entire solution stack:

• Reference architecture and best practices

• Optimized storage configurations

• Optimized network infrastructure

• Cloudera Enterprise Data Hub including Cloudera Distributed Hadoop

This solution is designed to address the clear majority of Apache Hadoop use cases including, but not limited to:

• Big data analytics

• ETL Offload

• Data Warehouse Optimization

• Batch processing of unstructured data

• Big data visualization

• Search and predictive analysis

Page 6: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

6 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

4. DriveScale Components Overview

DriveScale system is composed of one hardware component and four software components which are described below:

4.1 Hardware: DriveScale Adapter Chassis with DriveScale Adapters

This is a 1U appliance with adapters that connect to servers via 10Gb Ethernet interfaces and to JBOD’s via SAS interfaces.

4.2 Software

There are four principal components of the DriveScale software:

a) DriveScale Management Server (DMS)

• The server running the DMS software bundle is called the DMS node.

• A typical deployment consists of three DMS’s in a clustered configuration for high availability (HA).

• The software manages and configure resources and contains the inventory/configuration information repository and database:

3 Inventory: DMS’s, DS Adapters, switches, JBOD chassis, disks, server nodes

3 Configuration: node templates, cluster templates, configured clusters

3 DMS Database: used as a message bus to communicate with the end points.

b) DriveScale Server Agent

• DriveScale Server Agent discovery action provides inventory for hardware and servers, and creates mappings between server nodes and the disks they consume.

c) DriveScale Central (DSC)

Cloud-based software management portal that acts as the:

o software distribution repositories for subscribers

o DriveScale keys repository

Page 7: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

7 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

o centralized log file repository

o user documentation repository

o license manager

d) DriveScale Adapter Firmware

• DriveScale Adapter firmware enables the JBODs to be mapped to the servers and over the network to be used as local drives.

5. Reference Architecture Details

5.1 Physical Cluster Component List

The following table lists the physical components for the cluster.

Component Configuration Description Quantity

DriveScale Adapter Chassis

DHCP, Jumbo frame enabled

1U appliance with adapters that connect to servers via Ethernet, and to JBOD’s via SAS.

2

DriveScale Adapter DHCP, Jumbo frame enabled

Provides the data network.

4 for each chassis

DriveScale Management Server (DMS)

DMS running as a VM Manages and configures the nodes and cluster and also stores the inventory/configuration repository of every hardware in the cluster.

Min 1, for HA 3 DMS’s should be configured as master and slave

Servers 2 socket CPU and memory according to the individual Hadoop cluster requirements

Commodity x86 servers that house all the Node Manager, compute instances and DriveScale agents.

Min 3 Master nodes + 5 Data nodes

HDD for Servers 2 drives configured in RAID 1

The internal drives are used for OS install.

2 for each server

NICs Dual port 10 Gbps Ethernet SFP+ NICs.

Provides the data network

1 for each server

JBOD Chassis Default configuration Houses the drive with dual IO controllers.

Min 2, Recommended 3 for production environment by Cloudera

Page 8: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

8 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Component Configuration Description Quantity

HDD for JBOD Default configuration Drives to house the data for the cluster.

Depending on the cluster requirements

ToR 10G switch LLDP, MLAG, 9K Jumbo Frame configured

Provides data network connectivity.

2 for each rack

ToR 1G switch Default configuration Provides management network connectivity.

1 for each rack

5.2 Logical Cluster Topology

The minimum requirements to build out the cluster are:

• 3 Master Nodes

• 5 Data Nodes

• 1 DriveScale Adapter Chassis

• 1 DriveScale Management Server

• 2 10G Switches

• 1 1G Switch

• 2 JBOD’s chassis with drives

This reference architecture is built on 3 master nodes and 5 data nodes with 2 JBOD chassis and 126 drives of 1or 2 or 3 TB HDD. The following table lists the configurations of the servers and number of drives used.

For clusters that require the maximum read bandwidth out of each attached drive concurrently, it is recommended that the nodes in such a cluster be configured with a maximum of 8 drives each, assuming 2 x 10Gbps Ethernet bandwidth per node. However, this is an extreme case. A general rule of thumb for calculating the number of drives to allocate to each node in a cluster is dependent on the application but it is safe to allocate up to 16 drives per node, again assuming 2 x 10Gbps Ethernet bandwidth per node.

With the availability of quad-port 10Gbps Ethernet adapters, one can add significantly higher I/O per node and therefore greater numbers of drives as well.

Component Configuration Description Quantity

Master nodes 2 sockets 8 core CPU, 64GB RAM, 10GbE Intel NIC with 2 internal HDD for OS and 4 high capacity HDD mounted from the JBOD.

Master nodes hosts the Cloudera master services and DriveScale agents.

3

Page 9: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

9 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Component Configuration Description Quantity

Worker nodes 2 sockets 8 core CPU, 64GB RAM, 10GbE Intel NIC with 2 internal HDD for OS and 16 high capacity HDD mounted from the JBOD. For Impala nodes, the minimum RAM should be 128GB.

Data nodes house the HDFS Data Nodes and YARN Node managers, any additional required services and DriveScale agents.

5

Notes:

- Customers with higher (or lower) compute needs can acquire bigger (or smaller) data nodes configured with CPU and memory that fits the specific requirements of their applications.

- Similarly, depending on the data requirements, customers can add or remove disk drives to match the specific needs of their applications.

The following table identifies service roles for different node types.

Master Node Master Node Master Node Worker Node

ZooKeeper ZooKeeper ZooKeeper ZooKeeper

YARN Resource Manager

Resource Manager

History Server Node Manager

Hive MetaStore, WebHCat, HiveServer2

Management (misc)

Cloudera Agent Cloudera Agent Cloudera Agent, Oozie, Cloudera Manager, Management Services

Cloudera Agent

Navigator Navigator, Key Management Services

HUE HUE

HBASE HMaster HMaster HMaster Region Server

Page 10: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

10 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Master Node Master Node Master Node Worker Node

Impala StateStore, Catalog

Impala Daemon

Search Solr

Kafka Broker

Spark History Server Runs on YARN

HDFS NameNode, QJN NameNode, QJN QJN DataNode

5.3 Physical Cluster Topology

Diagram 1: DriveScale lab Architecture with 2xDSA Chassis (8x Adapters in use), 2x JBOD, 3 Master Nodes and 5 Data Nodes

Page 11: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

11 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Diagram 2: DriveScale lab Architecture with 3xDSA Chassis (12x Adapters in use), 2x JBOD, 3 Master Nodes and 5 Data Nodes

Notes:

- The 1GbE management connections were made to 10GbE switches, DSA chassis’ s and JBOD’s. The connections are omitted in the diagram to ease readability.

- 1GbE connection is used only for server management purpose with BMC IDRAC (Dell) or iLO (HPE). It is not a part of the Hadoop network. Please note that multi-homed clusters are not supported by Cloudera.

- SAS connections from DSA chassis 2 to JBOD 2 was also replicated as DSA chassis 1 to JBOD 1. The connections are omitted in the diagram to make it look less congested.

- The drives to the master nodes and data nodes were distributed across the two JBOD’s chassis.

Page 12: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

12 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

5.4 Cluster Management

This section details the steps for setting up a DriveScale enabled Hadoop cluster using Cloudera manager.

Setting up DriveScale cluster

Before installing Cloudera Manager or using an existing install of Cloudera Manager, you must complete the following tasks for setting up the DriveScale solution:

1. Rack and install the DriveScale Adapter chassis and controllers (DSAs) using the documentation provided by DriveScale.

2. Rack and install the JBOD’s using the documentation provided by the vendor.

3. Rack and install the server using the documentation provided by the vendor.

4. Create RAID 1 for the internal HDD on the server and install the OS on all the servers.

5. Install and configure DriveScale Management Server (DMS) either as a VM or on a standalone server.

6. Set up DSA configuration from DMS.

7. Install and configure DriveScale agents on the master and data nodes.

8. Create master/data node and cluster template with required drives using DMS.

9. Create the cluster from the template using DMS.

10. Ensure that DriveScale cluster is up and running before proceeding ahead.

Page 13: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

13 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Setting up Cloudera cluster

1. After the successful completion of the steps mentioned above, install Cloudera Manager using the Cloudera CDH Installation guide.

2. Ensure that Cloudera HDFS cluster is set up in a high availability mode.

3. The following services were set up for this reference architecture.

• HDFS

• HBase

• Hive

• Hue

• Impala

• Kafka

• Oozie

• Solr

• Spark

• YARN

• ZooKeeper

Page 14: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

14 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

4. Ensure that the master and data nodes are up and running with the right assigned roles and storage.

5.5 Enabling Hadoop Virtualization Extensions

With DriveScale solution, we enable configuration of a highly available Hadoop cluster including rack awareness. Hadoop Virtualization Extensions (HVE), enables customers to get additional capabilities for failure mitigation and rack awareness thereby enabling the cluster to survive the worst-case-scenario of total power or hardware failure of any component including JBOD failures for an extended period. HVE can be enabled in Cloudera Manager. To enable HVE, follow the documentation on HVE from Cloudera. Also, below are the steps we followed for this reference architecture.

For this reference architecture, below are the name details of the master and data nodes:

Node Types Server Names

Master Nodes u32.data1.r3.hq.drivescale.com

u33.data1.r3.hq.drivescale.com

u34.data1.r3.hq.drivescale.com

Data Nodes u27.data1.r3.hq.drivescale.com

u28.data1.r3.hq.drivescale.com

u29.data1.r3.hq.drivescale.com

u30.data1.r3.hq.drivescale.com

u31.data1.r3.hq.drivescale.com

Page 15: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

15 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

1. Go to the Cloudera Manager.

a) Configure the following safety valves based on your environment:

o HDFS

hdfs coresite.xml:

<property>

<name>net.topology.impl</name>

<value>org.apache.hadoop.net.NetworkTopologyWithNodeG

roup</value>

</property>

<property>

<name>net.topology.nodegroup.aware</name>

<value>true</value>

</property>

<property>

<name>dfs.block.replicator.classname</name>

<value>org.apache.hadoop.hdfs.server.blockmanagement.

BlockPlacementPolicyWithNodeGroup</value>

</property>

o YARN

YARN Service MapReduce Advanced Configuration Snippet (Safety Valve), add the following properties and values:

<property>

<name>mapred.jobtracker.nodegroup.aware</name>

<value>true</value>

</property>

<property>

<name>mapred.task.cache.levels </name>

<value>3</value>

</property>

Page 16: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

16 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

b) Based on the number of JBOD’s and data nodes, create a minimum of 3 zones with at least 1 or 2 data nodes in each zone.

Notes:

- If the replication factor required in your environment is 3, then a minimum of 3 zones are required while setting up HVE.

- This is because only one copy of the data is saved in each HVE zone.

Page 17: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

17 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

c) For this reference architecture, we created 4 zones with a minimum of 1 or 2 data nodes in each zone. Refer to the notes section below for detailed reasoning.

Notes:

- For this reference architecture, each of the JBODs have two drawers of drives. There are two expanders in each of the four drawers in the 2 JBODs. All the drives in each drawer were tagged using the DMS UI.

- Each of the nodes belonging to the same zone were created from drives of the same drawer of the JBOD.

- The table below lists all the nodes with the JBOD and drawer ID along with the zone ID.

Node Types Server Names JBOD/Drawer ID Zone ID

Master Nodes u32.data1.r3.hq.drivescale.com J1D1 1

u33.data1.r3.hq.drivescale.com J1D2 2

u34.data1.r3.hq.drivescale.com J2D1 3

Data Nodes u27.data1.r3.hq.drivescale.com J1D1 1

u28.data1.r3.hq.drivescale.com J1D2 2

u29.data1.r3.hq.drivescale.com J2D1 3

u30.data1.r3.hq.drivescale.com J2D2 4

u31.data1.r3.hq.drivescale.com J2D2 4

d) Select Hosts -> All hosts.

e) Select hosts u27 and u32.

f) Click on Action: Assign Rack

Page 18: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

18 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

g) Assign rack name /default/zone1

h) Select hosts u28 and u33

i) Click on Action: Assign Rack

j) Assign rack name /default/zone2

k) Select hosts u29 and u34

l) Click on Action: Assign Rack

m) Assign rack name /default/zone3

n) Select hosts u30 and u31

Page 19: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

19 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

o) Click on Action: Assign Rack

p) Assign rack name /default/zone4

2. Go back to Cloudera Manager and Hosts to see the new HVE configuration changes.

5.6 Disk and Filesystem Layout

Node/Role Disk and Filesystem Layout Description

Management/Master Ext4 1/2/3TB drives are mounted from the JBOD’s

YARN Node Manager nodes Ext4 1/2/3TB drives are mounted from the JBOD’s

5.7 OS Supportability/Compatibility Matrix

DMS Server Nodes

CentOS/RHEL 6.x X X

CentOS/RHEL 7.x X X

Ubuntu 14.04 X X

Page 20: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

20 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

5.8 JBOD Supportability/Compatibility Matrix

With DriveScale solution, we recommend customers should use high capacity JBODs with dual hot-pluggable IO controllers (Expanders) and enough upstream bandwidth. The JBODs should also have dual hot-pluggable redundant power supplies. DriveScale has evaluated and tested a few of the vendor offerings for redundancy, management functionality and performance. The table listed below has the JBOD vendor name and the model numbers which are certified by DriveScale.

JBOD Vendor Model Number

Dell PowerVault MD3060e - 3.5” and 2.5”, 60 bays, 4U, redundant expanders, 2 x 3 x mini-SAS 6G

Hewlett Packard Enterprise D6020 - 3.5”, 70 bays, 5U, quad expanders, 4 x 2 x mini-SAS 12G

D6000 - 3.5”, 70 bays, 5U, quad expanders, 4 x 2 x mini-SAS 6G

RAID Inc./Newisys NDS-4600/4603 - 3.5”, 60 bays, 4U, redundant expanders, 2 x 4 x mini-SAS 6G

NDS-2241 - 2.5”, 24 bays, 2U, redundant expanders, 2 x 3 x mini-SAS 6G

NDS-4900 - 3.5”, 90/96 bays, 4U, redundant expanders, 2 x 6 x mini-SAS-HD 12G

NDS-4900 - 3.5”, 84 bays, 4U, redundant expanders, 2 x 5 x mini-SAS-HD 12G

Qunta (QCT) M6400H - 3.5”, 60 bays, 4U, redundant expanders, 2 x 4 x mini-SAS 6G

JB4602 - 3.5”, 60 bays, 4U, redundant expanders, 2 x 4 x mini-SAS 12G

Promise Inc. J5300s - 3.5”, 12 bays,2U, redundant expanders, 2 x 2 x mini-SAS-HD 12G

J5320s - 2.5”,24 bays, 2U, redundant expanders, 2 x 2 x mini-SAS-HD 12G

J5600 - 3.5”, 16 bays, 3U, redundant expanders, 2 x 2 x mini-SAS-HD 12G

J5800 - 3.5”, 24 bays, 4U, redundant expanders, 2 x 2 x mini-SAS-HD 12G

Page 21: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

21 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

6. Rack Scalability

Customers can scale beyond one rack in a straightforward manner, in order to expand their compute and storage resources, as application needs grow. Customers can change or maintain the compute-to-storage ratio for the new racks or an existing rack. For every new JBOD addition, a new DriveScale Adapter with four controllers must be added as well. Since drives are assigned from within the rack to servers in the rack, scaling is achieved by simply adding more racks with Servers, DriveScale Adapters, Switches and JBODs.

Diagram 3: DriveScale Rack Scalability

7. References

1. Cloudera Manager Installation Guide

https://www.cloudera.com/documentation/enterprise/latest/topics/installation.html

2. Cloudera High Availability documentation

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_ha.html

3. High Availability for Other CDH components

https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_hag_cdh_other_ha.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d18

4. Cloudera Multihoming support documentation

https://www.cloudera.com/documentation/enterprise/release-notes/topics/rn_consolidated_pcm.html#cdh_cm_network_security

Page 22: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

22 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

8. Bill of Materials

Server Components Quantity

Intel Xeon Processsor based servers with dual or quad port 10GbE SFP+ NICs. The exact CPU models, number of sockets, and memory are based on customer application needs

Depends on customer application needs

JBOD Components Quantity

DriveScale certified JBODs Depends on customer application needs

NL-SAS HDDs Depends on customer application needs

Switch Quantity

DriveScale certified 10GbE SFP+ switches An even number of switches for redundant switch fabric

1GBaseT switch Based on the number of Servers and JBODs in configuration

DriveScale components Quantity

DriveScale Adapter Chassis One for each JBOD

DriveScale Adapter Four for each DSA Chassis

Software Version

CentOS Please refer to 6.7 section

DriveScale Adapter 1.3

CDH 5.10.0

Page 23: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

23 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

9. Conclusion

The DriveScale-Cloudera reference architecture guide is designed to provide an overview of the combined solution, the key components that are employed and details on how to install and setup clusters using these technologies.

10. Appendix A: Glossary of Terms

Term Description

Data Node Worker nodes of the cluster to which the HDFS data is written.

DSA DriveScale Adapter. DriveScale Adapter is a 1RU Ethernet to SAS adapter serving as a bridge between 10 Gbps Ethernet connecting compute resources to JBODs full of commodity disks.

DSC DriveScale Central. DriveScale Central is a web-based user interface to the DriveScale cloud that performs DriveScale account management. DSC is where you download the keys to enable installation of the DriveScale software, and then set up your DriveScale Management Domain(s) (DMDs). This is where you create your domain, select and configure the DMS nodes for the domain, and select a chassis (with its associated DriveScale Adapters, DSAs) for the domain.

DMS DriveScale Management Server. DriveScale Management Server is the server that runs the bundle of software (service) that manages a set of Physical Resources to enable the DriveScale services. DriveScale Manager is the web-based user interface to the DMS.

HBA Host bus adapter. An I/O controller that is used to interface a host with storage devices.

HDD Hard disk drive.

HDFS Hadoop Distributed File System.

High Availability Configuration that addresses availability issues in a cluster. In a standard configuration, the Name Node is a single point of failure (SPOF). Each cluster has a single Name Node, and if that machine or process became unavailable, the cluster is unavailable until the Name Node is either restarted or brought up on a new host. The secondary Name Node does not provide failover capability. High availability enables running two Name Nodes in the same cluster: the active Name Node and the standby Name Node. The standby Name Node allows a fast failover to a new Name Node in case of machine crash or planned maintenance.

JBOD Just a bunch of disks. A JBOD chassis hosts many HDDs and two redundant SAS switches (also called controllers). The SAS switches provide dual path access to each of the HDDs in the chassis through multiple Mini-SAS HD interface connectors

Page 24: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

24 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Term Description

Job History Server Process that archives job metrics and metadata. One per cluster.

MLAG Multi-chassis Link Aggregation. MLAG is the ability of two or more switches to act like a single switch when forming link bundles.

Name Node The metadata master of HDFS essential for the integrity and proper functioning of the distributed filesystem.

NIC Network interface card.

Node Manager The process that starts application processes and manages resources on the Data Nodes.

NUMA Non-uniform memory access. Addresses variable memory access latency in multi-socket servers. This is typical of SMP (symmetric multiprocessing) systems, and there are several strategies to optimize applications and operating systems. vSphere ESXi can be optimized for NUMA. It can also present the NUMA architecture to the virtualized guest OS, which can then leverage it to optimize memory access. This is called vNUMA.

PDU Power distribution unit.

QJM

QJN

Quorum Journal Manager. Provides a fencing mechanism for high availability in a Hadoop cluster. This service is used to distribute HDFS edit logs to multiple hosts (at least three are required) from the active Name Node. The standby Name Node reads the edits from the Journal Nodes and constantly applies them to its own namespace. In case of a failover, the standby Name Node applies all the edits from the Journal Nodes before promoting itself to the active state.

Quorum Journal Nodes. Nodes on which the journal services are installed.

RM Resource Manager. The resource management component of YARN. This initiates application startup and controls scheduling on the Data Nodes of the cluster (one instance per cluster).

ToR Top of rack.

ZK ZooKeeper. A centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services.

Page 25: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

25 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

11. Appendix B: DriveScale Cluster Install

Notes:

- You must complete the racking and cabling of the Servers, DSA, JBOD’s and switches per the details in the installation guide.

- You must obtain the credentials of DSC which is shipped with hardware.

- You must decide whether to set up your DMS as one standalone server or as a high-availability cluster with three servers. When three are used, the Management Domain can survive the failure of any one of the DMS machines.

- The DMS servers should be configured with at least 32-64 GB of memory.

- DHCP server in the 10G/1G network

- Access to the DHCP administrator/server to get the IP address(es) of the DSA(s) based on the MAC address(es) of the DSA(s).

- The network address(es) of your DMS server(s).

- The network address(es) of your DSA(s).

- The network addresses of your compute servers.

11.1 Configure Your Domains with DriveScale Central (DSC)

1. Log in to DSC with the credentials obtained from DS.

a) Go to https://central.drivescale.com/

b) Log in using the credentials provided to you by DriveScale.

c) A checklist of the tasks that need to be accomplished appears on the main DSC page.

Page 26: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

26 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

2. Go to the Domains link in the left navigation panel and click on Create Domain.

3. Fill the name, FQDN name and any notes for the domain. Click on Create.

4. Go to the Downloads link in the left navigation panel and download the config-Training,ds-dms-keys-xxx.rpm and ds-repo-xxx.rpm files on your local machine.

Page 27: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

27 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

11.2 Set up DMS nodes

1. On each of the DMS machines, copy and then install the repo and keys RPM package after downloading the file using WinScp tool or using scp command if you are using a linux machine.

scp ds-* [email protected]:/tmp

rpm -ivh ds-repo-*

rpm -ivh ds-dms*

2. On each of the DMS machines, copy the config.Training file into a file named /etc/drivescale/conf after downloading the file using WinScp tool or using scp command if you are using a linux machine.

scp config.Training [email protected]:/tmp

cp /tmp/config.Training /etc/drivescale/conf

3. Install the dms server on all the DMS machines using the yum install command. Yum will automatically install the dms server from the repo.

yum -y install dms-ds

11.3 Set up DriveScale Adapter (DSA)

1. Log into one of the DMS machines using ssh.

2. Set up the DriveScale management Domain configuration for each DSA using the same config file (config.Training in this example) as was used in the DMSes.

3. This is done via the /opt/drivescale/bin/dsa command installed on each DMS. The default DSA username and password is admin/admin.

4. Run the command listed below to check the current configuration and settings of DSA.

/opt/drivescale/bin/dsa --username admin --password admin --adapter <management IP of DSA adapter> service showconf

5. Verify the 10Gbps interface IPs address run the following command:

/opt/drivescale/bin/dsa --username admin --password admin --adapter <management IP of the DSA> net show --interface 10gBond

6. To verify the management IP/gateway etc of the DSA in case we have the FQDN only.

/opt/drivescale/bin/dsa --username admin --password admin --adapter <DSA Adapter Management FQDN > net show --interface mgmt.

7. Push the config from the DMS to the DSA

/opt/drivescale/bin/dsa --username admin --password admin --adapter <management IP of the DSA> service config --file /tmp/conf.Training

8. Restart DSA

/opt/drivescale/bin/dsa --username admin --password admin --adapter <management IP of the DSA> service restart

Page 28: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

28 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

11.4 Start the DMS and setup login to DMS

1. Log into one of the DMS machines using ssh.

2. Start the drivescale service by running the following command:

service drivescale start

3. Set up the DMS in SET_UP_MODE by entering the following command:

/opt/drivescale/bin/setup-mode

4. Login to the DMS UI using the FQDN name or IP address of the DMS.

5. Create a username and password for the first-time users.

6. To use internal authentication, you just need to provide the username for the initial admin (superuser), and enter and confirm the password. The first name and last name is optional.

7. Click Configure DMS to create the Admin user on the DMS, and to configure the authentication method

11.5 Set up Servers/DataNodes/MasterNodes

Notes:

• The two 10GbE ports are bonded into a single port for the DSA. Users need to bond the two 10GbE ports on the server nodes as well.

• Create the ifcfg file for creating the bond on the server.

auto bond0

iface bond0 inet static

slaves eth2 eth3

bond_miimon 100

bond_mode 802.3ad

bond_xmit_hash_policy layer2

pre-up ifconfig eth2 mtu 9000 && ifconfig eth3 mtu 9000

mtu 9000

auto vmbr1

iface vmbr1 inet static

address <IP for the 20G Bond interface>

netmask 255.255.240.0

bridge_ports bond0

bridge_stp off

bridge_fd 0

pre-up ifconfig eth2 mtu 9000

pre-up ifconfig eth3 mtu 9000

pre-up ifconfig bond0 mtu 9000

mtu 9000

Page 29: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

29 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

1. On each of the server machines, copy and then install the repo RPM package after downloading the file using WinScp tool or using scp command if you are using a linux machine.

scp ds-* [email protected]:/tmp

rpm -ivh ds-repo-*

2. Install the server software on all the server machines using the yum install command. Yum will automatically install the dms server from the repo.

yum -y install ds-server

3. Start the drivescale service.

service drivescale start

4. Remove the temporary ds-repo*, ds-dms* and /tmp/config.myDomain files from each of the server and DMS machines

11.6 Tagging JBOD and drives

1. Use the API documentation to build the JBOD tagging script.

2. Create a script to tag drives from each JBOD with a different tag.

3. Below is an example of the script. You can run the script from the DMS server.

#!/bin/bash

list=’5000c500560df547

5000c50003353ef7

5000c500033108ff

5000c50040b201db

5000c5000336dc2f

5000c50040b20167

5000c50034ee5887

5000c500035d7eb3

5000c500033aa843

5000c500033aaf9b

5000c500350a41bb

5000c50055a592db

5000c500033ae8f3

5000c50040e9fcbf

5000c50040ab7d53

5000c5003503a267

Page 30: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

30 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

5000c50040b230fb

5000c50040a6cf3b

5000c50040aa27f3

5000c50040b2aa77’

for api in $list

do

curl -u [email protected]:Admin11! -k -X PATCH --header ‘Content-Type: application/vnd.drivescale.v2+json’ --header ‘Accept: application/json’ -d ‘{

“tags”: [“6000_1”]

}’ ‘https://192.168.10.35/ds/entity/Drive/iqn.2013-04.com.drivescale%3Awwn%3A0x’$api;

sleep 1;

done

11.7 Creating Server Nodes and Clusters from templates

11.7.1 Creating Node and Cluster Template

1. Connect to the DMS UI Composer and from the left-hand panel navigate to the Composer section and click on the Node template.

2. From the top right corner select the Create template and fill in the details for the template name and click on Save.

3. Create 3 templates for MasterNode and 5 for DataNode

Notes:

• Create a DataNode and MasterNode Template for each of the data node with the following minimum requirements:

Data Nodes Minimum requirements

d1

Drives: 16

Use drives with all these tags: 6000_1

Exclude drives with any of these tags: 6000_2 6020_1 6020_2

d2

Drives: 16

Use drives with all these tags: 6000_2

Exclude drives with any of these tags: 6000_1 6020_1 6020_2

d3

Drives: 16

Use drives with all these tags: 6020_1

Exclude drives with any of these tags: 6000_1 6000_1 6020_2

Page 31: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

31 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Data Nodes Minimum requirements

d4/d5

Drives: 16

Use drives with all these tags: 6020_2

Exclude drives with any of these tags: 6000_2 6000_1 6020_1

m1

Drives: 16

Use drives with all these tags: 6000_1

Exclude drives with any of these tags: 6000_2 6020_1 6020_2

m2Drives: 16

Use drives with all these tags: 6000_2

m3Drives: 16

Use drives with all these tags: 6020_1

• Customers can change the disk, RPM, CPU or RAM according to their cluster requirements.

Page 32: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

32 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Page 33: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

33 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

4. From the left-hand panel navigate to the Composer section and click on the Cluster template.

5. From the top right corner select the Create template and fill in the details for the template name as “CDH_CERT_TEMPLATE”. Click on “Add new node type(s)” and select all the data and master node templates:

6. For the newly created cluster template, click on “Edit” and choose Min/Max instances as 1 for all the previously created data node and master node templates:

Page 34: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

34 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

7. Select “Save” after editing all the Min/Max instances for all the node template.

11.7.2 Creating Cluster Template

1. From the left-hand panel navigate to the Composer section and click on the Cluster template.

2. From the top right corner select the Create template and fill in the details for the template name as “CDH_CERT_TEMPLATE” with Data Nodes (mounted with 16 Disks) and 3 Master Nodes (mounted with 4 Disks) based on Cluster Template “CD”.

3. Select Create.

Page 35: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

35 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

11.7.3 DriveScale Cluster Verification

1. From the left-hand panel navigate to the Explorer section and click on the Logical section. Verify the cluster status.

2. Click on the Details tab on top right corner.

3. Select the Cluster_CD cluster and check for the details.

Page 36: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

36 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

12. Appendix B: Cloudera Manager Install

Notes:

- Cloudera Manager install was performed on Master Node 3 (u34.data1.r3.hq.drivescale.com).

- Make sure NTP is installed, configured and time is sync with all nodes.

- Make sure IPTABLES is stopped and SELINUX is disabled.

- Make sure the OS has access to internet to download Cloudera packages.

- Make sure all nodes have their correct FQDN HOSTNAME in /etc/sysconfig/network.

- Make sure on Cloudera manager host that /etc/hosts has an entry as follows:

192.168.69.16 u34.data1.r3.hq.drivescale.com

12.1 Cloudera Manager Installation Procedure for Reference Architecture

1. ssh to the Master node 3 and start the cloudera manager installation.

# ssh to u34:

[root@u34 ~] yum install wget

[root@u34 ~]# wget http://archive.cloudera.com/cm5/installer/latest/cloudera-

manager-installer.bin

[root@u34~]# chmod u+x cloudera-manager-installer.bin

[root@u34~]# sudo ./cloudera-manager-installer.bin

ê Accept install and Licensing.

2. Open Web Browser: http://u34.data1.r3.hq.drivescale.com:7180.

Login with default login/pass: admin/admin

Page 37: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

37 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

3. Accept the license agreement.

4. Select Cloudera Enterprise (60 days trial) or upload the Cloudera Enterprise license.

Page 38: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

38 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

5. Click on Continue.

6. Discover all nodes via FQDN or IP addresses by adding the names or IP and clicking on Search.

Page 39: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

39 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

7. Select all the hosts for the cluster installation and click on Continue.

8. Select the preferred method of repository installation, version of CDH, any additional parcels etc and click on Continue.

Page 40: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

40 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

9. Select the Install Oracle Java Developer Kit option and click on Continue.

10. For this cluster, Single User Mode was not enabled. Click on Continue.

Page 41: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

41 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

11. Enter the correct root credentials to connect the Nodes and click on Continue.

12. Wait for the cluster installation to complete.

Page 42: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

42 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Notes:

• In case of packages install failure, remove following packages on the nodes:

[root@dn~]# rpm -e --nodeps --justdb glibc-common-2.12-1.192.el6.x86_64

--allmatches ; rpm -e --nodeps --justdb glibc-2.12-1.192.el6.x86_64

--allmatches ; rpm -e --nodeps --justdb gdbm-1.8.0-39.el6.x86_64 –allmatches

• In case Java version running on Nodes is older: move JAVA from version 1.6 to 1.7:

[root@dn~]# java -version

[root@dn~]# cd /usr/java ; rm -f latest ; ln -s /usr/java/jdk1.7.0_67-cloudera

latest

• Transparent Huge Page Compaction can cause significant performance problems on all nodes. To disable this, run the following (add the same command to an init script such as /etc/rc.local so it will be set on system reboot):

[root@dn~]# echo never > /sys/kernel/mm/transparent_hugepage/defrag ; echo

never > /sys/kernel/mm/transparent_hugepage/enabled

13. Wait for the Parcels installation to complete.

Page 43: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

43 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

14. Run the host inspection agent to verify the correctness for packages. Click on Continue after the agent is run successfully.

Page 44: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

44 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

15. Select the services you would like to install on the nodes. For this setup, we have selected custom services and the details are

Page 45: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

45 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

16. Select the hosts which would have the various HBase, HDFS, Hue, Hive and other services running.

Page 46: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

46 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

17. For this reference architecture, we are using the “Embedded Database”. Copy the username and passwords for all the different databases. Click on Test Connection and ensure that it is successful. Click on Continue. Please be advised that Cloudera recommends using external database for production environment.

Page 47: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

47 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

18. Review all the changes for the setup and ensure all the details are correct. Click on Continue.

Page 48: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

48 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Page 49: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

49 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Page 50: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

50 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

19. After the installation completes, go to the main page and enable the services. The steps are mentioned below. All the actions are performed in the Cloudera Manager (CM)>Cluster1

a) Enable ZooKeeper HA (select the 3 Masters Node) and start roles.

b) Enable HDFS HA and select 2 Master nodes for NameNode and JournalNode:

Page 51: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

51 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

Notes:

• Open HDFS permission for HBASE on all MASTER nodes:

[root@u34 ~]# sudo -u hdfs /opt/cloudera/parcels/CDH-5.10.0-1.

cdh5.10.0.p0.41/bin/hadoop fs -chmod 777 /

• Initialize Postgres DB

[root@u34 ~]# service postgresql initdb

[root@u34 ~]# /etc/init.d/postgresql start

3 PGSQL Policy:

[root@u34 ~]# cat /var/lib/pgsql/data/postgresql.conf | grep -e listen -e

[root@u34 ~]# vi /var/lib/pgsql/data/postgresql.conf

listen_addresses = ‘*’

standard_conforming_strings = off

Page 52: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

52 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

[root@u34 ~]# vi /var/lib/pgsql/data/pg_hba.conf

host all all 0.0.0.0 0.0.0.0 md5

3 Install JDBC

[root@u34 ~]# yum install postgresql-jdbc

[root@u34 ~]# mkdir /usr/lib/hive/lib/

[root@u34 ~]# ln -s /usr/share/java/postgresql-jdbc.jar /usr/lib/hive/lib/

postgresql-jdbc.jar

[root@u34 ~]# /etc/init.d/postgresql restart

c) HIVE

- Create PGSQL user for HIVE (user: hive , pass: hive)

Connect to PostGresql DB:

[root@u34 ~]# sudo -u postgres psql

postgres=# CREATE USER hive WITH PASSWORD ‘hive’;

postgres=# CREATE DATABASE hive;

postgres=# GRANT ALL PRIVILEGES ON DATABASE hive TO hive;

Test connection:

[root@u34 ~]# psql -h u34.data1.r3.hq.drivescale.com -U hive -d hive

- Go to CM -> Cluster1 ->HIVE->Configuration-> Metastore DB

Change Default port from 7423 to 5432

DB user: hive

DB pass: hive

Page 53: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

53 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

- Go to CM -> Cluster1-> Hive

Action: Create MetaStore Database

Action: Start all Hive services

d) OOZIE:

- Create PGSQL user for OOZIE (user: oozie , pass: oozie)

Connect to PostGresql DB:

[root@u34 ~]# sudo -u postgres psql

postgres=# CREATE USER oozie WITH PASSWORD ‘oozie’;

postgres=# CREATE DATABASE oozie;

postgres=# GRANT ALL PRIVILEGES ON DATABASE oozie TO oozie;

Test connection:

[root@u34 ~]# psql -h u34.data1.r3.hq.drivescale.com -U oozie -d oozie

- Go to CM -> Cluster1 ->OOZIE->Configuration-> OOZIE DB

Change Default port from 7423 to 5432

DB user: oozie

DB pass: oozie

Page 54: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

Drivescale-Cloudera

54 of 55©2017 DriveScale Inc. All Right Reserved.

R E F E R E N C E A R C H I T E C T U R E

- Go to CM -> Cluster1-> OOZIE

Action: Create OOZIE Database

Action: Start all OOZIE services

e) HUE

- Create PGSQL user for HUE (user: hue , pass: hue)

Connect to PostGresql DB:

[root@u34 ~]# sudo -u postgres psql

postgres=# CREATE USER hue WITH PASSWORD ‘hue’;

postgres=# CREATE DATABASE hue;

postgres=# GRANT ALL PRIVILEGES ON DATABASE hue TO hue;

Test connection:

[root@u34 ~]# psql -h u34.data1.r3.hq.drivescale.com -U hue -d hue

- Go to CM -> Cluster1 ->OOZIE->Configuration-> HUE DB

Change Default port from 7423 to 5432

DB user: hue

DB pass: hue

- Go to CM -> Cluster1-> HUE

Action: Sync Database

Action: Start all HUE services

Page 55: DriveScale-CLOUDERA Reference Architecture · REFERENCE ARCHITECTURE ©2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level

f) SOLR

- Go to CM -> Cluster1-> SOLR

Action: Initialize SORL

Action: Start all SOLR services

g) SPARK

- Go to CM -> Cluster1-> SPARK

Action: Install SPARK Jar

Action: Create SPARK History Log Directory

Action: Start all SPARK services

h) Ensure all the services are up and running for the cluster.

- Go to CM -> Cluster1 -> Services and Hosts Status

55 of 55

DriveScale, Inc 1230 Midas Way, Suite 210 Sunnyvale, CA 94085

Main: +1(408) 849-4651 www. drivescale.com

©2017 DriveScale Inc. All Right Reserved.

WP.201703.02.01

Drivescale-Cloudera R E F E R E N C E A R C H I T E C T U R E