reference architecture for ibm cloud pak for data with ... · ibm cloud pak for data is a modern...

40
1 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage Xiaotong Jiang Xifa Chen Weixu Yang Lin Xu Last update: 14 October 2019 Version 1.0 Solution based on ThinkSystem SR650 and SR630 servers and ThinkSystem DE6000F and DM5000F storage Provide an overview of Cloud Pak for Data on top of OpenShift Describes the reference architecture for Cloud Pak for Data Describe a modern end-to-end data and analytics platform

Upload: others

Post on 26-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

1 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Xiaotong Jiang

Xifa Chen

Weixu Yang

Lin Xu

Last update: 14 October 2019 Version 1.0

Solution based on ThinkSystem SR650 and SR630 servers and ThinkSystem DE6000F and DM5000F storage

Provide an overview of Cloud Pak for Data on top of OpenShift

Describes the reference architecture for Cloud Pak for Data

Describe a modern end-to-end data and analytics platform

Page 2: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

2 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Table of Contents 1 Introduction ............................................................................................... 4

2 Business problem and business value ................................................... 5

2.1 Business Problem .................................................................................................... 5

2.2 Business Value ........................................................................................................ 5

3 Requirements ............................................................................................ 7

3.1 Functional requirements .......................................................................................... 7

3.2 Non-functional requirements .................................................................................... 8

4 Architectural overview ............................................................................. 9

4.1 IBM Cloud Pak for Data ........................................................................................... 9

4.2 RedHat OpenShift .................................................................................................. 10

5 Component model .................................................................................. 13

5.1 Base & Core ........................................................................................................... 14

5.2 Infra & Admin ......................................................................................................... 14

5.3 Add-ons and Db2 warehouse................................................................................. 14

5.4 Dashboard ............................................................................................................. 15

6 Operational model .................................................................................. 16

6.1 Networking ............................................................................................................. 18 6.1.1 Cluster network .......................................................................................................................... 19 6.1.2 Hardware management network ............................................................................................... 19

6.2 Systems management ........................................................................................... 20

6.3 Cloud Pak for Data on OpenShift deployment ....................................................... 21 6.3.1 Pre-requisites ............................................................................................................................. 21 6.3.1 Deployment Example ................................................................................................................ 21

7 Deployment considerations ................................................................... 23

7.1 Hardware description ............................................................................................. 23 7.1.1 Lenovo ThinkSystem SR650 Server ......................................................................................... 23

Page 3: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

3 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

7.1.2 Lenovo ThinkSystem SR630 Server ......................................................................................... 24 7.1.3 Lenovo ThinkSystem DE6000F All Flash Storagy Array ........................................................... 24 7.1.4 Lenovo ThinkSystem DM5000F Unified Flash Storage Array ................................................... 25 7.1.5 Lenovo RackSwitch G8052 ....................................................................................................... 26 7.1.6 Lenovo ThinkSystem NE1032/NE1032T Rack Switch .............................................................. 26 7.1.7 Lenovo RackSwitch NE10032 - Cross-Rack Switch ................................................................. 27

7.2 Performance considerations .................................................................................. 28

8 Appendix: Lenovo Bill of materials ....................................................... 31

8.1 Server BOM ........................................................................................................... 31

8.2 Networking BOM .................................................................................................... 34

8.3 Storage BOM ......................................................................................................... 34

Resources ..................................................................................................... 37

Document history ......................................................................................... 38

Page 4: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

4 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

1 Introduction This document describes the reference architecture for IBM Cloud Pak for Data on top of RedHat OpenShift on Lenovo ThinkSystem servers and storage. It provides a multi-cloud End to End (E2E) data and AI infrastructure for customers, along with an integrated and flexible workflow for processing data, helping integrate and unlock the value of all customers’ data. This reference architecture provides planning, design, and deployment considerations for implementing Cloud Pak for Data with Lenovo products.

With the ever-increasing volume, variety and velocity of data available to an enterprise comes the challenge of deriving the most value from it. This task requires multiple source data collection, suitable data management, flexible and extendable data processing and easy data model inference deployment. Cloud Pak for Data brings the power of AI to the enterprise. Cloud Pak for Data is an all-in-one multi-cloud data and AI platform that can be containerized and deployed on top of OpenShift built on On-Prem or public cloud infrastructure, to provide a secure environment for data collection, organization, and analysis. Cloud Pak for Data expands and enhances this technology to withstand the demands of your enterprise, adding management, security, governance, and analytics features. The result is that you get a more enterprise ready solution for complex, large-scale analytics.

OpenShift brings a containerized platform to Cloud Pak for Data with many benefits that cannot be obtained on physical infrastructure or in the cloud. Containerization simplifies the management of your big data and AI infrastructure, enables faster time to results and makes it more cost effective. It is a proven software technology that makes it possible to run multiple operating systems and applications on the same server at the same time. Containerization can increase IT agility, flexibility, and scalability while creating significant cost savings. Workloads get deployed faster, performance and availability increases and operations become automated, resulting in IT that is simpler to manage and less costly to own and operate.

This reference architecture is intended for IT professionals, technical architects, sales engineers, and consultants to assist in planning, designing, and implementing the Cloud Pak for Data solution with Lenovo hardware. Knowledge of common big data processing, container, OpenShift and cloud will be helpful. For more information about Cloud Pak for Data, please see “Resources”.

Page 5: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

5 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

2 Business problem and business value 2.1 Business Problem The world is well on its way to generate more than 40 million TB of data by 2020. Businesses must be able to keep pace with the demand for resources in order to benefit. This data comes from everywhere, including sensors that are used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone global positioning system (GPS) signals. Research agency Burning Glass Technologies, in association with IBM, predicts demand for data scientists will grow by 28% by 2020 and academic institutions will not be able to fulfill the demand1. Meanwhile, our data scientists are busy at leveraging multiple tools and spend much of their time managing, protecting, and collecting multi-source data. According to Forbes, Data scientists spend approximately 80% of their time preparing data for analysis. The gap between the huge volume of big data to prepare/manage and limited energy/time of data scientists requires a more efficient data and analytics integration platform.

Per M& I sources, 85% of organizations are committed to a Multi-Cloud strategy. Investments in cloud technology and resources are also on the rise, according to the sources. A majority of respondents also said they plan to maintain or increase their investment in cloud over the next two years, including both internal private and external public cloud.

Decision points for moving applications to a multi-cloud environment include:

performance of the application

compliance and security regulations

availability/resiliency requirements

total cost savings

faster delivery time

A containerized data and analytics platform that can be deployed on enterprise multi-cloud environment helps meet these business requirements.

2.2 Business Value IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data analytics and AI capability in customers’ own on premise (on-prem) environment. It can be deployed on top of an OpenShift container platform. Customers have the flexibility to collect, process and analyze data on-prem and then decide whether to deploy an AI model – either on-prem or in the public cloud. Cloud Pak for Data simplifies and unifies how you collect, organize and analyze data to accelerate the value of data science and AI. This multi-cloud platform delivers a broad range of core data micro-services, with the option to add more from a growing services catalog. It will let user to experience greater flexibility, security and control, and the benefits of the cloud without having to move data. 1 The Quant Crunch - How the Demand for Data Science Skills is Disrupting the Job Market, Looking Glass Technologies, 2017

Page 6: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

6 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

One of the key on-prem components required to deploy Cloud Pak for Data is the infrastructure platform. Lenovo has partnered with IBM to verify the Lenovo ThinkSystem platform for Cloud Pak for Data. Lenovo ThinkSystem servers integrated with OpenShift provide a flexible, secure and scalable container platform. ThinkSystem servers have a rich set of configurable options depending upon the data workload and business needs. Together with Lenovo ThinkSystem infrastructure, Cloud Pak for Data provides an E2E (end to end) data analytic on-prem cloud solution and speed up revealing value from multi-source enterprise data for business.

Page 7: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

7 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

3 Requirements The functional and non-functional requirements for this reference architecture are desribed in this section.

3.1 Functional requirements A modern data and analytics solution supports the following key functional requirements:

Unified platform

o A single platform that integrates data management, data governance and analysis for greater efficiency and improved use of resources. Enable self-service collaboration across teams.

AI-ready

o Manage end-to-end data workflows to help ensure that data is easily accessible for AI. Make sure that your data is high-quality to deliver accurate, automated insights and decisions. Seamlessly build and manage machine learning models across development and production in a collaborative environment

Cloud-native agility

o Accelerate application development and deployment with a multicloud data platform that is agile, resilient and portable. Benefit from Kubernetes containerization to provision and scale services in minutes, instead of months, inside a more secure, governed environment.

Data virtualization

o Query data easily and more securely across multiple sources, on cloud or on premises. Exploit the combined processing power of those sources for massive query acceleration and achieve the speed and scalability your business needs for today's and tomorrow’s workloads.

Extensible APIs

o Use Cloud Pak for Data API in your applications to accelerate implementations and deliver significant business value.

Customized workflow

o Provision preferred data services flexibly and rapidly and customize data workflows to your individual needs.

Continuous intelligence

o Develop real-time streaming applications and deliver continuous intelligence across your business. With IBM Streams on IBM Cloud Pak for Data, you can enable continuous and rapid analysis of massive volumes of data in motion or at rest.

Page 8: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

8 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

3.2 Non-functional requirements Customers require their big data solution to be easy, dependable, and fast. The following non-functional requirements are key:

Easy:

o Ease of development o Easy management at scale o Advanced job management o Multi-tenancy o Easy to access data by various user types

Dependable:

o Data protection with snapshot and mirroring o Automated self-healing o Insight into software/hardware health and issues o High availability (HA) and business continuity

Fast:

o Superior performance o Scalability

Secure and governed:

o Strong authentication and authorization o Kerberos support o Data confidentiality and integrity o Efficiently respond to changing regulations with embedded, sophisticated governance capabilities;

these include automated discovery and classification of data, masking of sensitive data, data zones and data lifecycle management

Page 9: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

9 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

4 Architectural overview 4.1 IBM Cloud Pak for Data This chapter gives an architectural overview of IBM Cloud Pak for Data. Figure 1 gives a high-level overview of the multi-cloud architecture of Cloud Pak for Data.

Figure 1 Multi-cloud Architecture of Cloud Pak for Data

In this architecture, the Cloud Pak for Data admin console provides a unified control plane for user management, data management, data governance, data analysis, and business analysis in multiple locations – on the public cloud, and on premise in the data center. In this sense, the on-prem cluster is essentially an extension of the public cloud. The Cloud Pak for Data admin console provides centralized configuration and security management across the clusters. This centralized control provides a consistent mechanism to manage distributed data analytic clusters, configuration policy, and security.

The deployment of Cloud Pak for Data on-prem requires an OpenShift cluster, which provides the compute and storage containerized. In this document, Cloud Pak for Data on-prem clusters will be deployed as pods running on top of the OpenShift cluster. This simplifies the Cloud Pak for Data deployments because you do not need dedicated hosts for implementing Cloud Pak for Data clusters. Instead, multiple applications can be installed on the same cluster. See Figure 2 for the architecture of the Cloud Pak for Data on-prem clusters when deployed on OpenShift.

Page 10: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

10 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Figure 2 Architecture of the Cloud Pak for Data on OpenShift

There are three main function planes:

Admin/User control plane – This plane has over-all admin functions. It allow customers to manage

users and user permissions, create projects, Govern data quality across the organization, Monitoring

and managing active analytics environments and resources, deploy models and expose endpoints,

etc. Data/AI analytic plane – it is a powerful computational engine. User can create a Python, R, or Scala

notebook-based project, create a data connection to data source, and transform and analyse data by

using this platform.

Add-on plane – Cloud Pak for Data includes a catalog of add-ons that customers can use to extend

the functionality of Cloud Pak for Data. The catalog includes the following types of add-ons:

AI, Analytics, Dashboards, Data governance, Data sources, Developer tools, Industry accelerators,

Storage.

4.2 RedHat OpenShift The Red Hat OpenShift Container Platform is a complete container application platform that provides all aspects of the application development process in one consistent solution across multiple infrastructure footprints. OpenShift integrates all of the architecture, processes, platforms, and services needed to help

Page 11: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

11 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

development and operations teams traverse traditional siloed structures and produce applications that help businesses succeed. Cloud Pak for Data is natively built with Red Hat OpenShift Container Platform. Containers built to run on any OpenShift environment can migrate seamlessly to Cloud Pak for Data – and on any OpenShift cloud or in the System.

Figure 3 shows the high level architecture of the Red Hat OpenShift Container Platform and the core building blocks. OpenShift is a platform designed to orchestrate containerized workloads across a cluster of nodes. The system uses the Kubernetes as the core container orchestration engine, which manages the Docker container images and their lifecycle.

Figure 3 Red Hat OpenShift Container Platform Architecture

The physical configuration of the OpenShift platform is based on the Kubernetes cluster architecture. The master node is the primary node on which the Kubernetes scheduler, along with the distributed cluster data store (etcd), the REST API services, and other associated management services run. In a product environment, you need to ensure high availability of the master services through replicating the services to multiple physical servers and implementing monitoring and load-balancing services such as Keepalived and HAproxy. The infrastructure nodes can be used in a product setting to implement such services.

Application nodes (or just shown as Node in the diagram) run the users containerized applications on top of the Docker container environment. With OpenShift, you can easily write and deploy applications knowing that they’ll run on a platform optimized for Red Hat OpenShift. When choosing to deploy a private cloud on-

Page 12: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

12 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

premises, Cloud Pak for Data System provides optimized hardware to increase the container performance of the Red Hat cluster while speeding the time to value of data workloads.

Page 13: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

13 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

5 Component model Cloud Pak for Data provides features and capabilities that meet the functional and non-functional requirements of customers. It supports the need for an end-to-end solution for data and analytics within an enterprise across different industries, such as financial services, retail, media, healthcare, manufacturing, telecommunications, and government organizations. One of its design principles was to help organizations access a vast array of data sources on-premises and in the cloud—all while applying deep data management and analytics within a private cloud setting.

Cloud Pak for Data enables users to connect to data (no matter where it lives), govern it, virtualize it, and use it for analysis. Cloud Pak for Data also enables all of your data users to collaborate from a single, unified interface, so your IT department doesn't need to deploy and connect multiple applications.

Cloud Pak for Data native cloud hyper-converged architecture consists of a set of core software components that can run on-prem in the data center or in the public cloud. Together, the components provide all the services required to explore and profile, transform, and analyze data from a single web application across the different clouds as well as provide the common policy framework and centralized configuration management to catalog, manage and govern users’ data.

Figure 4 Cloud Pak for Data Components

As shown in Figure 4, Cloud Pak for Data is composed of 4 main logical components. Following sections is a brief description of these components.

Page 14: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

14 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

5.1 Base & Core Cloud Pak for Data features data collection, data virtualization, data governance, and data processing practices at its core, which can be applied to application. Auditing data access and verifying access privileges. Cloud Pak for Data allows administrators to configure, collect, and view audit events, and generate reports. Cloud Pak for Data tracks access permissions and actual accesses to all entities in enterprise.

Cloud Pak for Data APIs facilitate programmatic management of users and their access control, along with user account management. They can be used to interact with your governance metadata to manage assets, custom asset types, and the association between them. They provide the capability to manage analytics projects and the collaborative use of assets (notebooks, scripts, datasets), allowing users to quickly harvest insight from the data in a repetitive fashion as well as from automated job scheduling. The API's also automate deployment (from development to production) and help maintain machine learning models, making them accessible through HTTP endpoints on the platform.

5.2 Infra & Admin Cloud Pak for Data enables administrators to create and delete users, modify users’ profile and grant privilege to users. Cloud Pak for Data can connect to an LDAP server for admission control, using a custom SSL or TLS certificate for HTTPS connections to the web client.

Cloud Pak for Data Infra & Admin sets the standard for enterprise deployment by delivering granular visibility into and control over every part of the data and AI jobs, which empowers operators to improve performance, enhance quality of service, increase compliance, and reduce administrative costs. Cloud Pak for Data makes administration of your enterprise data processing and AI jobs simple and straightforward, at any scale.

Cloud Pak for Data monitors a number of performance and health metrics for services and role instances that are running on your clusters. These metrics are monitored against configurable thresholds and can be used to indicate whether a cluster is functioning as expected. You can view these metrics in the web client, which displays metrics about jobs, pod, services, clusters and so on.

Cloud Pak for Data deploy and integrate several types of database and message systems. Cloudant is a distributed database that is optimized for handling heavy workloads that are typical of large, fast-growing web and mobile apps. Available as an SLA-backed, fully managed cloud and on-prem service, Cloudant elastically scales throughput and storage independently. Kafka is a distributed commit log service. Kafka functions much like a publish/subscribe messaging system, but with better throughput, built-in partitioning, replication, and fault tolerance. Kafka is a good solution for large scale message processing applications. Influxdb is a time series database is used to store log, sensor and other data, over a period of time. Influxdb has seen significant traction and is known for its simplicity and ease of use, along with its ability to perform at scale.

5.3 Add-ons and Db2 warehouse Cloud Pak for Data allows customers to extend the functionality with add-ons and integrations. Add-ons are services that are deployed in your Cloud Pak for Data cluster. Integrations are connections to applications that run outside of your Cloud Pak for Data cluster.

Page 15: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

15 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Cloud Pak for Data includes a catalog of add-ons: AI, Analytics, Dashboards, Data governance, Data sources, Developer tools, Industry accelerators, and Storage. For more information about add-ons, see this website: https://docs-icpdata.mybluemix.net/docs/.../com.ibm.icpdata.doc/zen/admin/add-ons.html

Cloud Pak for Data has 2 integrations. Customer can audit sensitive data and synchronize data by integrate IBM Guardium and StoredIQ with Cloud Pak for Data. For more information about the integrations, see this website:

https://docs-icpdata.mybluemix.net/docs/.../com.ibm.icpdata.doc/zen/admin/integrations.html.

IBM Db2 Warehouse is a software-defined data warehouse supporting Docker container technology, and it can be deployed as an add-on database in cluster. This data warehousing approach is client-managed and optimized for fast and flexible deployment. Expect automated scaling to meet agile analytic workloads. With IBM Db2 Warehouse, you have control over your data and applications without having to handle complex database deployment and management tasks. Based on the number of worker nodes selected, IBM Cloud Pak for Data automatically creates the appropriate data warehouse environment. For more information about IBM Db2 Warehouse as an add-on, see this website:

https://www.ibm.com/support/knowledgecenter/SSQNUZ_2.1.0/com.ibm.icpdata.doc/zen/admin/work-with-db-db2wh.html#work-with-db-db2wh

5.4 Dashboard Cloud Pak for Data has a web-based dashboard. From the dashboard, users can do their administrator operations, data management, data governance and analysis in a unified web-based UI. Figure 5 shows web client dashboard of Cloud Pak for Data.

Figure 5 Dashboard of Cloud Pak for Data

Page 16: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

16 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

6 Operational model This section describes the operational model for the Cloud Pak for Data reference architecture. As described in the previous chapters, Cloud Pak for Data is deployed on top of OpenShift. In this reference architecture, we deployed Cloud Pak for Data in an OpenShift cluster on top of the Lenovo ThinkSystem platform. Figure 6 shows the Cloud Pak for Data deployment architecture with the ThinkSystem certified nodes. The on-prem deployment consists of a single OpenShift cluster with ThinkSystem servers and storage. Each host system runs the OpenShift components on Red Hat Enterprise Linux operating system. The OpenShift cluster provides persistent storage via persistent volume claim (PVC) with local or external storage. An OpenShift cluster consists of two types of nodes: Master and Node. An OpenShift Master provides the core API and management services for the Kubernetes cluster. An OpenShift Node provides the container run time environment and compute, storage, and network services for the Kubernetes cluster. OpenShift Master and Node use ThinkSystem SR630/SR650 servers. Worker nodes use ThinkSystem SR650 servers with locally attached storage or external ThinkSystem DE6000F/ DM5000F storage arrays.

Figure 6 Cloud Pak for Data deployment Architecture with the ThinkSystem Nodes

This Cloud Pak for Data reference architecture is implemented on a set of pods on OpenShift that make up a cluster. A Cloud Pak for Data cluster consists of two types of logical nodes: Master and Worker.

Page 17: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

17 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Master nodes and Worker nodes run the following types of services:

• Management control services for coordinating and managing the cluster • Miscellaneous and optional services for file and web serving • Data services for collecting, organizing, analyzing and infusing data.

A Cloud Pak for Data deployment consists of two Master nodes for high availability, and three or more Worker nodes.

Cloud Pak for Data is deployed as micro-service applications on the OpenShift platform. Figure 7 shows part of Cloud Pak for Data components applications.

Figure 7 Cloud Pak for Data Components Applications on OpenShift platform

IBM Db2 Warehouse is deployed as an add-on database in the cluster. Db2 Warehouse is designed to provide organizations with the highly flexible architecture that is needed in the dynamic, fast-moving world of big data and cloud computing. Db2 Warehouse leverages external ThinkSystem DE6000F/ DM5000F arrays as backend storage. For a single node, the warehouse uses a symmetric multiprocessing (SMP) architecture for cost-efficiency. For two or more nodes, the warehouse is deployed using a massively parallel processing (MPP) architecture for high availability and improved performance. Figure 8 shows a dashboard of Db2 Warehouse deployed as an add-on in a Cloud Pak for Data cluster.

Page 18: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

18 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Figure 8 Db2 Warehouse in Cloud Pak for Data Cluster

More details on the networking, hardware system management and deployment steps are described in the following sections.

6.1 Networking The reference architecture specifies two networks: a high-speed cluster network and a management network. Two types of top of rack switches are required; one 1Gbps switch for out-of-band management and a pair of 10Gbps switches for the cluster network for High Availability. See Figure 9 below.

Page 19: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

19 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Figure 9 Cloud Pak for Data on OpenShift network

6.1.1 Cluster network The data network creates a private cluster among multiple nodes and is used for high-speed data transfer across nodes, and also for importing data into the cluster. The cluster typically connects to the customer’s corporate data network. The recommended 10 GbE switch is the Lenovo RackSwitch™ NE1032.

The two 10GbE NIC ports of each node are link aggregated into a single bonded network connection giving 20Gbps of bandwidth. The two data switches are connected together as a Virtual Link Aggregation Group (vLAG) pair using LACP to provide the switch redundancy. If a NE1032 switch drops out of the network, the other NE1032 continues transferring traffic. The switch pairs are connected with dual 10Gbps links called an ISL, which allows maintaining consistency between the two peer switches.

6.1.2 Hardware management network The hardware management network is a 1GbE network for out-of-band hardware management. The recommended 1GbE switch is the Lenovo RackSwitch G8052 with 10Gbps SFP+ uplink ports. Through the XClarity™ Controller management module (XCC) within the ThinkSystem SR650 and SR630 servers, the out-of-band network enables hardware-level management of cluster nodes, such as node deployment, UEFI firmware configuration, hardware failure status and remote power control of the nodes.

Page 20: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

20 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

6.2 Systems management The Lenovo XClarity Administrator software provides centralized resource management that reduces management complexity, speeds up response, and enhances the availability of Lenovo® server systems and solutions. The Lenovo XClarity Administrator provides agent-free hardware management for Lenovo’s ThinkSystem® rack servers, System x® rack servers, and Flex System™ compute nodes and components, including the Chassis Management Module (CMM) and Flex System I/O modules.

Figure 10 shows the Lenovo XClarity™ Administrator interface in which servers, storage, switches and other rack components are managed and status is shown on the dashboard. Lenovo XClarity™ Administrator is a virtual appliance that is quickly imported into a virtualized server environment.

Figure 10 XClarity™ Administrator interface

In addition, xCAT provides a scalable distributed computing management and provisioning tool that provides a unified interface for hardware control, discovery and operating system deployment. It can be used to facilitate or automate the management of cluster nodes. For more information, see: Lenovo XClarity Administrator Product Guide

Page 21: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

21 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

6.3 Cloud Pak for Data on OpenShift deployment 6.3.1 Pre-requisites

In order to perform initial configuration and installation of the Cloud Pak for Data clusters, you need to have a running OpenShift version 3.11 platform and a Helm/Tiller Version 2.9.1 that can get access to the OpenShift platform. You need to apply for an IBM Passport account and download the installation file for an IBM Cloud Private installation. At that point, you can run the IBM Cloud Private installation file and download the Cloud Pak for Data installation file. In addition to the Cloud Pak for Data installation file, you also need to create a project in OpenShift and create a required security context constraint. Then, you need to create a cluster role binding and bind it to the default service account. Llastly, you need to log into a docker registry and create the docker secret for an icp4d-anyuid service account. More detailed instructions to prepare a Cloud Pak for Data on OpenShift deployment can be found here:

https://www.ibm.com/support/knowledgecenter/SSQNUZ_current/com.ibm.icpdata.doc/zen/install/openshift-noicp.html

6.3.1 Deployment Example In order to perform configuration and installation of the Cloud Pak for Data platform on OpenShift, cluster deployer needs to deploy an OpenShift Platform on top of the ThinkSystem server platform as well as the ThinkSystem storage platform if the customer needs external storage for the storage service. Cluster deployer will use these servers and storage to create an OpenShift Platform with external storage support. An OpenShift cluster with the high availability (HA) feature enabled is recommended for production environments to ensure that the cluster has no single point of failure. In the reference architecture, it is a cluster with three masters, one HAProxy load balancer, three working nodes, and one storage node using the native HA method.

Table 1 Deployment Example

Node type Quantity Node role

Load Balancer 1 OpenShift HAProxy.

Master 3 OpenShift API master, Kubernetes scheduler, etcd.

Node 3 Runs the application containers.

Storage 1 For external storage service

Note: Cluster deployer can use VMs to deploy Load balancer and Master for reducing cost.

Table 2 provides a minimum node configuration summary for a deployment example on Lenovo ThinkSystem Platform.

Table 2 Minimum Node Configuration

Node CPU Memory Hard Disk

Network Adapter

Master 4 CPU(s) 32GB 100GB 2 10G NIC

Page 22: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

22 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Load Balancer

4 CPU(s) 32GB 100GB 2 10G NIC

Node 32 CPU(s) 256GB 300GB 2 10G NIC

Note: External storage node’s minimum node configuration depends on customers’ specific requirement.

For the network, two Lenovo ThinkSystem NE1032 10Gbps switches and one ThinkSystem G8052 1Gbps switches are deployed in cluster.

The cluster administrator can scale up the clusters later by adding more nodes or creating additional clusters based on requirements. In this reference architecture, we provided a recommended mid-sized, production level hardware configuration based on a rough workload profile estimate. The configuration bill of materials are provided in section 8.

More detailed installation instructions for Cloud Pak for Data on OpenShift deployment can be found here:

https://www.ibm.com/support/knowledgecenter/en/SSQNUZ_2.1.0/com.ibm.icpdata.doc/zen/install/openshift-noicp.html

More detailed installation instructions for Db2 Warehouse as Cloud Pak for Data add-on can be found here:

https://www.ibm.com/support/knowledgecenter/en/SSQNUZ_2.1.0/com.ibm.icpdata.doc/zen/admin/install-data-source-add-ons.html

Page 23: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

23 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

7 Deployment considerations This section describes considerations for deploying the Cloud Pak for Data solution with ThinkSystem servers and storage.

7.1 Hardware description In this section we will describe the various hardware components and options to implement the Cloud Pak for Data platform.

7.1.1 Lenovo ThinkSystem SR650 Server Lenovo ThinkSystem SR650 is an ideal 2-socket 2U rack server for small businesses up to large enterprises that need industry-leading reliability, management, and security, as well as maximizing performance and flexibility for future growth. The SR650 server is particularly suited for data processing, AI applications and business analytics due to its selection of high performance Intel processors, large internal memory and rich internal data storage. It is also designed to handle general workloads, such as databases, virtualization and cloud computing, virtual desktop infrastructure (VDI), enterprise applications, and collaboration/email.

The SR650 server supports:

Up to two second-generation Intel® Xeon® Scalable Processors Up to 3.0 TB TruDDR4 memory Up to 24x 2.5-inch or 14x 3.5-inch drive bays with an extensive choice of NVMe PCIe SSDs,

SAS/SATA SSDs, and SAS/SATA HDDs Flexible I/O Network expansion options with the LOM slot, the dedicated storage controller slot, and

up to 6x PCIe slots The Lenovo ThinkSystem SR650 Server is shown in the following figure.

Figure 11 Lenovo ThinkSystem SR650

Combined with second-generation Intel® Xeon® Scalable Processors (Bronze, Silver, Gold, and Platinum), the Lenovo SR650 server offers an even higher density of workloads and performance that lowers the total cost of ownership (TCO). Its pay-as-you-grow flexible design and great expansion capabilities solidify dependability for any kind of workload with minimal downtime.

The SR650 server provides high internal storage density in a 2U form factor with its impressive array of workload-optimized storage configurations. It also offers easy management and saves floor space and power consumption for most demanding use cases by consolidating storage and server into one system.

Page 24: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

24 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

For more information, see the Lenovo ThinkSystem SR650 Product Guide:

https://lenovopress.com/lp1050-thinksystem-sr650-server-xeon-sp-gen2

7.1.2 Lenovo ThinkSystem SR630 Server Lenovo ThinkSystem SR630 is an ideal 2-socket 1U rack server for small businesses up to large enterprises that need industry-leading reliability, management, and security, as well as maximizing performance and flexibility for future growth. The SR630 server is designed to handle a wide range of workloads, such as databases, virtualization and cloud computing, virtual desktop infrastructure (VDI), infrastructure security, systems management, enterprise applications, collaboration/email, streaming media, web, and HPC.

Combined with second generation Intel Xeon Processors (Xeon SP Gen 2), the SR630 server offers scalable performance and storage capacity. The SR630 server supports up to two processors, up to 2933 MHz memory speed, up to 3 TB of memory capacity with TruDDR4 DIMMs, up to 12x 2.5-inch or 4x 3.5-inch drive bays with an extensive choice of NVMe PCIe SSDs, SAS/SATA SSDs, and SAS/SATA HDDs, and flexible I/O expansion options with a LOM slot, a dedicated storage controller slot, and up to 3x PCIe slots. In additional, The SR630 with Xeon SP Gen 2 supports up to 7.5 TB of memory capacity with a combination of TruDDR4 DIMMs and Intel DC persistent memory modules (DCPMMs)

The SR630 server offers basic or advanced hardware RAID protection and a wide range of networking options, including selectable LOM, ML2, and PCIe network adapters. The next-generation Lenovo XClarity Controller, which is built into the SR630 server, provides advanced service processor control, monitoring, and alerting functions.

The Lenovo ThinkSystem SR630 Server is shown in the following figure.

Figure 12 Lenovo ThinkSystem SR630

For more information, see the Lenovo ThinkSystem SR630 Product Guide:

https://lenovopress.com/lp1049-thinksystem-sr630-server-xeon-sp-gen2

7.1.3 Lenovo ThinkSystem DE6000F All Flash Storagy Array Lenovo ThinkSystem DE6000F is a scalable, all flash mid-range storage system that is designed to provide high performance, simplicity, capacity, security, and high availability for medium to large businesses. The ThinkSystem DE6000F delivers enterprise-class storage management capabilities in a performance-optimized system with a wide choice of host connectivity options and enhanced data management features. The ThinkSystem DE6000F is a perfect fit for a wide range of enterprise workloads, including big data and analytics, video surveillance, technical computing, and other storage I/O-intensive applications.

Page 25: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

25 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

ThinkSystem DE6000F models are available in a 2U rack form-factor with 24 small form-factor (2.5-inch SFF) drives (2U24 SFF) and include two controllers, each with 64 GB cache for a system total of 128 GB. Universal 10 Gb iSCSI or 4/8/16 Gb Fibre Channel (FC) ports provide base host connectivity, and the host interface cards provide additional 12 Gb SAS, 10/25 Gb iSCSI, or 8/16/32 Gb FC connections.

The ThinkSystem DE6000F Storage Array scales up to 192 solid-state drives (SSDs) with the attachment of Lenovo ThinkSystem DE240S 2U24 SFF Expansion Enclosures.

The Lenovo ThinkSystem DE6000F 2U24 SFF enclosure is shown in the following figure.

Figure 13 Lenovo ThinkSystem DE6000F

For more information, see the Lenovo ThinkSystem ThinkSystem DE6000F Product Guide: https://lenovopress.com/lp0910-lenovo-thinksystem-de6000f-all-flash-storage-array

7.1.4 Lenovo ThinkSystem DM5000F Unified Flash Storage Array Lenovo ThinkSystem DM5000F is a unified, all flash storage system that is designed to provide performance, simplicity, capacity, security, and high availability for medium enterprises. Powered by the ONTAP software, ThinkSystem DM5000F delivers enterprise-class storage management capabilities with a wide choice of host connectivity options and enhanced data management features. The ThinkSystem DM5000F is a perfect fit for a wide range of enterprise workloads, including big data and analytics, artificial intelligence, engineering and design, enterprise applications, and other storage I/O-intensive applications.

ThinkSystem DM5000F models are 2U rack-mount controller enclosures that include two controllers, 64 GB RAM and 8 GB battery-backed NVRAM (32 GB RAM and 4 GB NVRAM per controller), and 24 SFF hot-swap drive bays (2U24 form factor). Controllers provide universal 1/10 GbE NAS/iSCSI or 8/16 Gb Fibre Channel (FC) ports, or 1/10 GbE RJ-45 ports for host connectivity.

A single ThinkSystem DM5000F Storage Array scales up to 144 solid-state drives (SSDs) with the attachment of Lenovo ThinkSystem DM240S 2U24 SFF Expansion Enclosures.

Page 26: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

26 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Figure 14 Lenovo ThinkSystem DM5000F

For more information on Lenovo ThinkSystem DM5000F, visit this link:

https://lenovopress.com/lp0911-lenovo-thinksystem-dm5000f-unified-flash-storage-array

7.1.5 Lenovo RackSwitch G8052 The Lenovo networking RackSwitch G8052 is an Ethernet switch that is designed for the data center and provides a simple network solution. The Lenovo RackSwitch G8052 offers up to 48x 1 GbE ports and up to 4x 10 GbE ports in a 1U footprint. The G8052 switch is always available for business-critical traffic by using redundant power supplies, fans, and numerous high-availability features.

Figure 15 Lenovo RackSwitch G8052

Lenovo RackSwitch G8052 has the following characteristics:

• A total of 48x 1 GbE RJ45 ports • Four 10 GbE SFP+ ports • Low 130W power rating and variable speed fans to reduce power consumption

For more information, see the Lenovo RackSwitch G8052 Product Guide: https://lenovopress.com/tips1270-lenovo-rackswitch-g8052

7.1.6 Lenovo ThinkSystem NE1032/NE1032T Rack Switch The Lenovo ThinkSystem NE1032/NE1032T RackSwitch family is a 1U rack-mount 10 Gb Ethernet switch that delivers lossless, low-latency performance with feature-rich design that supports virtualization, Converged Enhanced Ethernet (CEE), high availability, and enterprise class Layer 2 and Layer 3 functionality. The hot-swap redundant power supplies and fans (along with numerous high-availability features) help provide high

Page 27: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

27 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

availability for business sensitive traffic. These switches deliver line-rate, high-bandwidth switching, filtering, and traffic queuing without delaying data.

The NE1032 RackSwitch has 32x SFP+ ports that support 1 GbE and 10 GbE optical transceivers, active optical cables (AOCs), and direct attach copper (DAC) cables.

Figure 16 Lenovo ThinkSystem NE1032 RackSwitch

For more information, see the ThinkSystem NE1032 Product Guide.

The NE1032T RackSwitch has 24x 1/10 Gb Ethernet (RJ-45) fixed ports and 8x SFP+ ports that support 1 GbE and 10 GbE optical transceivers, active optical cables (AOCs), and direct attach copper (DAC) cables.

Figure 17 Lenovo ThinkSystem NE1032T RackSwitch

For more information, see the ThinkSystem NE1032T Product Guide.

7.1.7 Lenovo RackSwitch NE10032 - Cross-Rack Switch The Lenovo ThinkSystem NE10032 RackSwitch that uses 100 Gb QSFP28 and 40 Gb QSFP+ Ethernet technology is specifically designed for the data center. It is ideal for today's big data workload solutions and is an enterprise class Layer 2 and Layer 3 full featured switch that delivers line-rate, high-bandwidth switching, filtering and traffic queuing without delaying data. Large data center-grade buffers help keep traffic moving, while the hot-swap redundant power supplies and fans (along with numerous high-availability features) help provide high availability for business sensitive traffic.

The NE10032 RackSwitch has 32x QSFP+/QSFP28 ports that support 40 GbE and 100 GbE optical transceivers, active optical cables (AOCs), and direct attach copper (DAC) cables. It is an ideal cross-rack aggregation switch for use in a multi rack big data Cloudera cluster.

Page 28: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

28 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Figure 18 Lenovo ThinkSystem NE10032 cross-rack switch

For further information on the NE10032 switch, visit this link:

https://lenovopress.com/lp0609-lenovo-thinksystem-ne10032-rackswitch

7.2 Performance considerations The hardware required for the on-prem Cloud Pak for Data on OpenShift clusters will depend upon the specific workload and user requirements. Hence, the sizing of the cluster will vary based on the types of data processing workloads, performance and scalability requirements, number of container images expected to run, the deployment type – test and development, staging, and production, etc. Since Cloud Pak for Data runs as a pod cluster, the actual container pods execute inside the OpenShift platform. Hence, you need to determine the right size of the physical machines OpenShift runs on: CPUs, RAM, disk, etc. For example, if you anticipate a production deployment of OpenShift and the workloads are enterprise level data applications with multiple tiers such as AI inference, business logic, database, etc., then you may need to choose the worker nodes with a good number of CPUs and memory. By contrast, with a test/dev type environment your worker node configuration could be small. Meanwhile, you need to determine how many Cloud Pak for Data clusters you would install, then you can aggregate the total to determine what kind of physical resources you would need to implement the clusters. This translates to a number of ThinkSystem servers and storage with specific physical CPU cores, core speed, physical memory, and disk.

Processor Selection

Cloud Pak for Data workload types may be skewed toward IO-bound workloads that create heavy network traffic or CPU bound workloads that stress the CPU itself. Intel Gold processors in this reference architecture provide a 2 processor core per drive ratio which gives the maximum drive throughput plus a full set of cores for additional data analytics. Intel Processors in the Platinum class provide higher core counts to meet the highest of CPU bound workloads.

Below are several examples of IO-bound workloads:

• Sorting • Indexing • Grouping • Data importing and exporting • Data movement and transformation

Page 29: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

29 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Below are several examples of CPU-bound workloads:

• Clustering/Classification • Complex text mining • Natural-language processing • Feature extraction

Memory Selection

For some in-memory data processing engines like Db2 warehouse, memory size and performance have a larger impact on performance. For this reason, in-memory data processing workloads are recommended to use higher memory.

Additional considerations for memory configuration include bandwidth and latency requirements.

Applications with high transactional memory usage should focus on DIMM configurations that are balanced across the CPU memory controllers and their memory channels.

Persistent storage for containerized workloads

There are two types of storage consumed by containerized applications – ephemeral (non-persistent) and persistent. As the names suggest, non-persistent storage is created and destroyed along with the container and is only used by applications during their lifetime as a container. Hence, non-persistent storage is used for temporary data. When implementing the OpenShift Container Platform, local disk space on the application nodes can be configured and used for the non-persistent storage volumes.

Persistent storage, on the other hand, is used for data that needs to be persisted across container instantiations. An example is a 2 or 3-tier application that has separate containers for the web and business logic tier and the database tier. The web and business logic tier can be scaled out using multiple containers for high availability. The database that is used in the database tier requires persistent storage that is not destroyed.

OpenShift uses a persistent volume framework that operates on two concepts – persistent storage and persistent volume claim. Persistent storage is the physical storage volumes that are created and managed by the OpenShift cluster administrator. When an application container requires persistent storage, it would create a persistent volume claim (PVC). The PVC is a unique pointer/handle to a persistent volume on the physical storage, except that PVC is not bound to a physical volume. When a container makes a PVC request, OpenShift would allocate the physical disk and binds it to the PVC. When the container image is destroyed, the volume bound to the PVC is not destroyed unless you explicitly destroy that volume. In addition, during the lifecycle of the container if it relocates to another physical server in the cluster, the PVC binding will still be maintained. After the container image is destroyed, the PVC is released, but the persisted storage volume is not deleted. The specific persistent storage policy for the volume will determine when the volume gets deleted.

A variety of persistent storage options are available for OpenShift, choices including NFS, OpenStack Cinder, Ceph RBD, iSCSI, fiber channel SAN, hyperconverged storage using Red Hat OpenShift Container Storage, AWS elastic block storage (EBS), and others. For a complete list of these choices and the corresponding

Page 30: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

30 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

requirements, see the link below: access.redhat.com/documentation/en-us/openshift_container_platform/3.9/html-single/installation_and_configuration/#configuring-persistent-storage

Network Considerations

To support high-availability in the data network, redundant switches should be specified for each tier of switches in the cluster. Section 6.1 describes the data network topology with the 10Gb cross-rack redundant switch pairs. These switch pairs should be configured for Virtual Link Aggregation Groups (vLAG) on Lenovo switches (or LACP) which provides coherency between the pairs to continue transferring traffic when a single switch drops out.

Also, the two server ethernet port configurations for NIC bonding or NIC teaming must also be configured for LACP (mode=4 or mode=802.3ad). This way, a single NIC, network cable or switch can fail and that network connection will continue with the remaining half of the network connection. The bonded NIC interface also operates at twice the speed, or 20Gb/s in this configuration.

Page 31: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

31 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

8 Appendix: Lenovo Bill of materials This appendix contains the bill of materials (BOMs) for different configurations of hardware for Cloud Pak for Data on OpenShift deployments. There are sections for servers, storage and networking.

8.1 Server BOM

Lenovo ThinkSystem SR630 (Master, Load Balancer)

Code Description Quantity 7X02CTO1WW Server : ThinkSystem SR630 - 3yr Warranty 1 AUW0 ThinkSystem SR630 2.5" Chassis with 8 Bays 1 B4HN Intel Xeon Gold 5215 10C 85W 2.5GHz Processor 2 AUND ThinkSystem 32GB TruDDR4 2666 MHz (2Rx4 1.2V) RDIMM 6 AUWB ThinkSystem SR530/SR630/SR570 2.5" SATA/SAS 8-Bay Backplane 1 5977 Select Storage devices - no configured RAID required 1 AUNG ThinkSystem RAID 530-8i PCIe 12Gb Adapter 1 B0YM ThinkSystem 2.5" 2TB 7.2K SAS 12Gb Hot Swap 512e HDD FIPS 4 AUMV ThinkSystem M.2 with Mirroring Enablement Kit 1 B11V ThinkSystem M.2 5100 480GB SATA 6Gbps Non-Hot Swap SSD 2 AUKK ThinkSystem 10Gb 4-port SFP+ LOM 1 AVW8 ThinkSystem 550W (230V/115V) Platinum Hot-Swap Power Supply 2 6400 2.8m, 13A/100-250V, C13 to C14 Jumper Cord 2 AUPW ThinkSystem XClarity Controller Standard to Enterprise Upgrade 1 AXCA ThinkSystem Toolless Slide Rail 1 B0MJ Feature Enable TPM 1.2 1 B0ML Feature Enable TPM on MB 1 AWGE ThinkSystem SR630 WW Lenovo LPK 1 AUWX 8x2.5" HDD BP Cable Kit 1 AUTC ThinkSystem SR630 Lenovo Agency Label 1 AVEN ThinkSystem 1x1 2.5" HDD Filler 4 B4NK ThinkSystem SR630 Refresh MB 1 AUW7 ThinkSystem SR630 4056 Fan Module 2 AULP ThinkSystem 1U CPU Heatsink 2 AVJ2 ThinkSystem 4R CPU HS Clip 2 AUTJ ThinkSystem common Intel Label 1 AUTA XCC Network Access Label 1 AUTV ThinkSystem large Label for non-24x2.5"/12x3.5"/10x2.5" 1 AVWK ThinkSystem EIA Plate with Lenovo Logo 1 AUX4 MS 1U Service Label LI 1 AWF9 ThinkSystem Response time Service Label LI 1 AUX3 ThinkSystem SR630 Model Number Label 1

Page 32: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

32 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

AUX0 ThinkSystem Package for SR630 1 AVWH ThinkSystem 550W RDN PSU Caution Label 1 AUWM Lenovo ThinkSystem 1U LP+LP BF Riser Dummy 1 AUWL Lenovo ThinkSystem 1U LP Riser Dummy 1 AUWF Lenovo ThinkSystem Super Cap Holder Dummy 1

B173 Companion Part for XClarity Controller Standard to Enterprise Upgrade in Factory

1

AUWG Lenovo ThinkSystem 1U VGA Filler 1 ASFE Notice for Advanced Format 512e Hard Disk Drives 1 5PS7A01504 Essential Service - 3Yr 24x7 4Hr Response + YourDrive YourData 1 5AS7A02045 Hardware Installation Server (Business Hours) 1 7S0FCTO1WW Red Hat Linux w/Lenovo Support 1

S0N6 RHEL Server Physical or Virtual Node, 2 Skt Standard Subscription w/Lenovo Support 3Yr

1

Lenovo ThinkSystem SR650 (Node)

Code Description Quantity 7X06CTO1WW Server : ThinkSystem SR650 - 3yr Warranty 1 AUVV ThinkSystem SR650 2.5" Chassis with 8, 16 or 24 bays 1 B4HH Intel Xeon Gold 6240 18C 150W 2.6GHz Processor 2 B4H3 ThinkSystem 32GB TruDDR4 2933MHz (2Rx4 1.2V) RDIMM 12 AUR5 ThinkSystem 2U/Twr 2.5" AnyBay 8-Bay Backplane 1 5977 Select Storage devices - no configured RAID required 1 AUNL ThinkSystem 430-8i SAS/SATA 12Gb HBA 1

B589 ThinkSystem U.2 Intel P4610 1.6TB Mainstream NVMe PCIe3.0 x4 Hot Swap SSD

2

B49B ThinkSystem 2.5" Intel S4510 1.92TB Entry SATA 6Gb Hot Swap SSD 6 AUMV ThinkSystem M.2 with Mirroring Enablement Kit 1 B11V ThinkSystem M.2 5100 480GB SATA 6Gbps Non-Hot Swap SSD 2 AUKJ ThinkSystem 10Gb 2-port SFP+ LOM 1 AUKX ThinkSystem Intel X710-DA2 PCIe 10Gb 2-Port SFP+ Ethernet Adapter 1 AVWG ThinkSystem 1600W (230V) Platinum Hot-Swap Power Supply 2 6400 2.8m, 13A/100-250V, C13 to C14 Jumper Cord 2 AUPW ThinkSystem XClarity Controller Standard to Enterprise Upgrade 1 AXCA ThinkSystem Toolless Slide Rail 1 AURD ThinkSystem 2U left EIA Latch Standard 1 B0MJ Feature Enable TPM 1.2 1 B4NL ThinkSystem SR650 Refresh MB 1 AWFF ThinkSystem SR650 WW Lenovo LPK 1 B6ZQ ThinkSystem SR650 Agency Label Lenovo, No CCC 1

Page 33: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

33 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

AUTY ThinkSystem 12-15 sequence Label for 24x2.5"Chassis 1 AUTU ThinkSystem 4-7 NVMe sequence Label for 16x2.5"and 24x2.5" 1 AVEQ ThinkSystem 8x1 2.5" HDD Filler 2 AUTA XCC Network Access Label 1 AVJ2 ThinkSystem 4R CPU HS Clip 2 AUSF Lenovo ThinkSystem 2U MS CPU Performance Heatsink 2 AUSG ThinkSystem SR650 6038 Fan module 1

B173 Companion Part for XClarity Controller Standard to Enterprise Upgrade in Factory

1

AUTJ ThinkSystem common Intel Label 1 B31F ThinkSystem M.2 480GB SSD Thermal Kit 1 AURS Lenovo ThinkSystem Memory Dummy 12 AUTQ ThinkSystem small Lenovo Label for 24x2.5"/12x3.5"/10x2.5" 1 AWF9 ThinkSystem Response time Service Label LI 1 AUSZ ThinkSystem SR650 Service Label LI 1 AVWK ThinkSystem EIA Plate with Lenovo Logo 1 AUTD ThinkSystem SR650 model number Label 1 AUT9 ThinkSystem 1600W RDN PSU Caution Label 1 AURT Lenovo ThinkSystem 2U 3FH Riser Dummy 1 AURF Lenovo ThinkSystem 2U 2FH Riser Dummy 1 AUSA Lenovo ThinkSystem M3.5" Screw for EIA 4 AUSU ThinkSystem Package for SR650 1 AUSH MS First 2U 8x2.5" HDD BP Cable Kit 1 AUSQ On Board to 2U 8x2.5" HDD BP NVME Cable 1 B0ML Feature Enable TPM on MB 1 A2HP Configuration ID 01 1 5374CM1 Configuration Instruction 1 AVE7 ThinkSystem 430-8i SAS/SATA 12Gb HBA placement 1 A2JX Controller 01 1 A2HP Configuration ID 01 1 5PS7A06897 Premier with Essential - 3Yr 24x7 4Hr Response + YourDrive YourData 1 5AS7A02045 Hardware Installation Server (Business Hours) 1 7S0FCTO1WW Red Hat Linux w/Lenovo Support 1

S0N6 RHEL Server Physical or Virtual Node, 2 Skt Standard Subscription w/Lenovo Support 3Yr

1

Page 34: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

34 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

8.2 Networking BOM

Lenovo ThinkSystem NE1032 Switch

Code Description Quantity

7159HD1 Switch : Lenovo ThinkSystem NE1032 RackSwitch (Rear to Front) 1 AU3A Lenovo ThinkSystem NE1032 RackSwitch (Rear to Front) 1 A1PH 1m Passive DAC SFP+ Cable 1 6204 2.8m, 10A/100-250V, C13 to IEC 320-C20 Rack Power Cable 2

Lenovo RackSwitch G8052

Code Description Quantity

7159G52 Lenovo System Networking RackSwitch G8052 (Rear to Front) 1 6201 1.5m, 10A/100-250V, C13 to IEC 320-C14 Rack Power Cable 2 3802 1.5m Blue Cat5e Cable 3 A3KP Lenovo System Networking Adjustable 19" 4 Post Rail Kit 1

8.3 Storage BOM

Lenovo ThinkSystem DE6000F(Storage)

Code Description Quantity 7Y79CTO1WW Storage : Lenovo ThinkSystem DE6000F All Flash Array SFF 1 B38L Lenovo ThinkSystem Storage 2U24 Chassis 1 B4D9 SAS 1 B4J9 Lenovo ThinkSystem DE6000 12Gb SAS 4-ports HIC 2 B4JP Lenovo ThinkSystem DE6000 Controller 64GB 2 B4BT Lenovo ThinkSystem DE Series 800GB 3DWD 2.5" SSD 2U24 1 B4RX Lenovo DE Series 3.84TB 1DWD 2.5" SSD 2U24 12 B4BP Lenovo ThinkSystem Storage USB Cable, Micro-USB 1 6201 1.5m, 10A/100-250V, C13 to IEC 320-C14 Rack Power Cable 2 B4JD Lenovo ThinkSystem DE6000F Premium Bundle 1 B38Y Lenovo ThinkSystem Storage Rack Mount Kit 2U24/4U60 1 B4AR Lenovo ThinkSystem DE Series Ship Kit (RoW), 2U 1 B4M0 Lenovo ThinkSystem DE6000H SMID Controller Base Setting 1 B4AW Lenovo ThinkSystem Storage Packaging 2U 1 B38Z Lenovo ThinkSystem Storage SFF Drive Filler 11 B4JF Lenovo ThinkSystem DE6000F Product Label 1 B4AY Lenovo ThinkSystem DE Series 2U24 End Cap Kit (Pair) 1 B4BG Lenovo ThinkSystem Storage 2U24 System Label 1 B4JJ Lenovo ThinkSystem DE6000H Controller Upgrade Key FC to DE6000F 1

Page 35: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

35 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

B4JL Lenovo ThinkSystem DE6000 Add Snapshot 2048 PFK 1 B4JH Lenovo ThinkSystem DE6000H Add Synch Mirroring PFK 1 B4JG Lenovo ThinkSystem DE6000H Add Asynch Mirroring PFK 1 5AS7A02067 Hardware Installation Storage (Business Hours) 1

Lenovo ThinkSystem DM5000F (Storage)

Code Description Quantity 7Y41CTO1WW Controller : Lenovo ThinkSystem DM5000F All Flash Array 1 B38L Lenovo ThinkSystem Storage 2U24 Chassis 1 B5RJ DM Series Premium Offering 1 B39F Lenovo ThinkSystem DM Series DM3000/DM5000 Cntr, 16Gb FC/10Gb Opt 2 B65R Lenovo ThinkSystem23TB (6x 3.84TB, 2.5", SSD) Drive Pack for DM5000F 2 A3RG 0.5m Passive DAC SFP+ Cable 2 B4BP Lenovo ThinkSystem Storage USB Cable, Micro-USB 1 6311 2.8m, 10A/100-250V, C13 to C14 Jumper Cord 2 B6KC Lenovo ThinkSystem DM Series ONTAP 9.5 SW, Encryption 1 B0W1 3 Years 1 B46Y Foundation Service 1 B472 Configured with Lenovo ThinkSystem DM5000F 1 B38Y Lenovo ThinkSystem Storage Rack Mount Kit 2U24/4U60 1 B4CX Lenovo ThinkSystem DM Series 2U Accessory 1 B39L Lenovo ThinkSystem DM Series 2U24 Bezel 1 B38Z Lenovo ThinkSystem Storage SFF Drive Filler 12 B4BG Lenovo ThinkSystem Storage 2U24 System Label 1 B396 Lenovo ThinkSystem DM5000F Product Label 1 B4AW Lenovo ThinkSystem Storage Packaging 2U 1 B39C Lenovo ThinkSystem DM Series Ship Kit (RoW) 1 B4SF DM Series CIFS Protocol License 2 B4SG DM Series NFS Protocol License 2 B4SH DM Series iSCSI Protocol License 2 B4SJ DM Series FCP Protocol License 2 B4SK DM Series SnapMirror License 2 B4SL DM Series SnapRestore License 2 B4SM DM Series FlexClone License 2 B4SN DM Series Software Encryption License 2 B4SP DM Series SnapManager License 2 B4SU TPM 2 B5AZ DM Series SnapVault License 2 B7AQ SnapMirror Synchronous 2

Page 36: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

36 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

5WS7A18251 Foundation- 3Y NBD ThinkSystem DM5000F AFA 1 5WS7A18257 Foundation- 3Y NBD DM5000F 46TB (12x 3.84TB SSD) Pack 1 5AS7A02067 Hardware Installation Storage (Business Hours) 1 Auto-Derived Part Items AU16 0.5m External MiniSAS HD 8644/MiniSAS HD 8644 Cable 2

Page 37: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

37 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Resources For more information, see the following resources:

Lenovo ThinkSystem SR650 server:

• Product guide: https://lenovopress.com/lp1050-thinksystem-sr650-server-xeon-sp-gen2 • 3D Tour: https://lenovopress.com/lp0673-3d-tour-thinksystem-sr650

Lenovo ThinkSystem SR630 server:

• Product guide: https://lenovopress.com/lp1049-thinksystem-sr630-server-xeon-sp-gen2 • 3D Tour: https://lenovopress.com/lp0672-3d-tour-thinksystem-sr630

Lenovo ThinkSystem DE6000F storage:

• Product guide: https://lenovopress.com/lp0910-lenovo-thinksystem-de6000f-all-flash-array • 3D Tour: https://lenovopress.com/lp0956-thinksystem-de-all-flash-interactive-3d-tour

Lenovo ThinkSystem DM5000F storage:

• Product guide: https://lenovopress.com/lp0911-thinksystem-dm5000f-unified-flash-array • 3D Tour: https://lenovopress.com/lp0958-thinksystem-dm-all-flash-interactive-3d-tour

Lenovo RackSwitch G8052 (1GbE Switch): • Product guide: https://lenovopress.com/tips1270-lenovo-rackswitch-g8052

Lenovo RackSwitch NE1032/NE1032T (10GbE Switch): • Product guide: https://lenovopress.com/lp0605-thinksystem-ne1032-rackswitch • Product guide: https://lenovopress.com/lp0606-thinksystem-ne1032t-rackswitch

Lenovo ThinkSystem NE10032 (40GbE/100GbE Switch): • Product guide: https://lenovopress.com/lp0609-lenovo-thinksystem-ne10032-rackswitch

Lenovo XClarity Administrator: • Product guide: https://lenovopress.com/tips1200-lenovo-xclarity-administrator

IBM Cloud Pak for Data: • Product page: https://www.ibm.com/analytics/cloud-pak-for-data • Manual guide: https://www.ibm.com/com.ibm.icpdata.doc/zen/overview/overview.html

IBM DB2 warehouse: • Product page: https://www.ibm.com/products/db2-warehouse

Red Hat OpenShift: • Product page: https://www.redhat.com/en/technologies/cloud-computing/openshift • Version 3.11 manual: https://docs.openshift.com/container-platform/3.11/welcome/index.html

Page 38: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

38 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Document history Version 1.0 14 Oct 2019 First version

Page 39: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

39 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

Trademarks and special notices © Copyright Lenovo 2019.

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. These and other Lenovo trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by Lenovo at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of Lenovo trademarks is available from https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both: AnyBay™ Flex System™ Lenovo® Lenovo XClarity™ RackSwitch™ Lenovo(logo)® System x® ThinkSystem™ TruDDR4™

The following terms are trademarks of other companies:

Intel, Xeon, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

References in this document to Lenovo products or services do not imply that Lenovo intends to make them available in every country.

Information is provided "AS IS" without warranty of any kind.

All customer examples described are presented as illustrations of how those customers have used Lenovo products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.

Information concerning non-Lenovo products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by Lenovo. Sources for non-Lenovo list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. Lenovo has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-Lenovo products. Questions on the capability of non-Lenovo products should be addressed to the supplier of those products.

All statements regarding Lenovo future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local Lenovo office or Lenovo authorized reseller for the full text of the specific Statement of Direction.

Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in Lenovo product announcements. The information is presented here to communicate Lenovo’s current investment and development activities as a good faith effort to help with our customers' future planning.

Performance is based on measurements and projections using standard Lenovo benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon

Page 40: Reference Architecture for IBM Cloud Pak for Data with ... · IBM Cloud Pak for Data is a modern data and analytics platform with built-in governance. Cloud Pak for Data enables data

40 Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and Storage

considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.

Photographs shown are of engineering prototypes. Changes may be incorporated in production models.

Any references in this information to non-Lenovo websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this Lenovo product and use of those websites is at your own risk.