the survival handbook for deploying to on-premise 080819 (1) · centers in the world, we put...

The Survival Handbook for Deploying to On-Premise, Air-Gapped, or Multi-

Cloud Environments BY GRAVITATIONAL

© Gravitational, Inc. All Rights Reserved | [email protected] 2

CONTENTS

INTRODUCTION ........................................................................................................................................................................... 3

LIFECYCLE MANAGEMENT AND OPERATIONS ................................................................................................................ 4 Avoiding Deployment Fragmentation .................................................................................................................................. 4

Simplifying Installation Complexity ...................................................................................................................................... 4

Handling Incompatible Resources ......................................................................................................................................... 5

Reducing Upgrade Complexity .............................................................................................................................................. 5

Handling Application Upgrade Failures ............................................................................................................................... 6

Creating Consistent Application Environments ................................................................................................................. 7

MANAGING RELEASE AND UPGRADE CYCLES ................................................................................................................. 8 Shortening Upgrade Cycle Times .......................................................................................................................................... 8

Managing On-Prem Releases and Versioning .................................................................................................................... 8

Upgrade Scheduling ................................................................................................................................................................. 8

Managing External Dependencies ........................................................................................................................................ 9

Publishing Installable Software .............................................................................................................................................. 9

THE ROAD TO PRODUCTION .............................................................................................................................................. 10 Managing Open Source Software Dependencies .......................................................................................................... 10

How to Pass Security Audits ............................................................................................................................................... 10

Managing Evaluations and Monitoring Usage ................................................................................................................. 12

Managing Database Deployments ..................................................................................................................................... 13

Simplifying Kubernetes Operations ................................................................................................................................... 14

Recovering from Failures ..................................................................................................................................................... 14

Monitoring and Troubleshooting Deployments ............................................................................................................. 15

Setting Up Data Storage ...................................................................................................................................................... 16

Maintaining Data Integrity ................................................................................................................................................... 16

Accessing Deployments Remotely .................................................................................................................................... 17

Network Configuration ......................................................................................................................................................... 18

MANAGING ORGANIZATIONAL ISSUES ........................................................................................................................... 19 Setting Expectations with Customer IT Organizations ................................................................................................. 19

CONCLUSION ............................................................................................................................................................................ 22


INTRODUCTION

Almost all successful B2B SaaS vendors eventually receive a request to deliver their solution on-premise into multiple co-location centers or cloud providers from a large, security-minded customer. There are several reasons why customers want this, but they all fall into two major categories:

After several years of deploying and running complex applications in some of the most secure, air-gapped data centers in the world, we put together this survival handbook for companies to help them evaluate, prepare and survive going on-prem.

In this guide, we share some of the technical and organizational challenges we have seen when going on-prem. We also present some of the solutions we have researched, developed and productized through our Gravity platform. Because we believe that Kubernetes offers numerous advantages when delivering complex applications on-prem into one or multiple cloud providers such as AWS, Azure, or GCP, many of our solutions will focus on how to leverage Kubernetes to overcome the challenges described.

In addition to this guide, we also offer a series of workshops that focus on the more technical aspects of the technologies we use, Docker and Kubernetes.

Before we dive in, we want to mention an important caveat: You should only offer on-prem installations if customers are ready to pay a large premium for them. Such installations require a significant investment on your part, so you should be sure there is significant and repeatable demand for your efforts. Of course, as the people behind Gravity, we also encourage you to explore how it can help with your multi-cloud, on-prem, or air-gapped deployment efforts.

If you are ready to forge ahead, the challenges that you will face fall into the following categories:

Finally, there are some challenges not directly related to the technical implementation details, and we will touch upon these. Good luck. We hope this handbook helps.

� Security/regulatory compliance: Sensitive data requires privileged access exclusively from within the company or through vetted service providers. This data cannot leave the premises

� Data locality and latency: Network and security monitors, load balancers and web application firewalls are intended to run in the data center or it is easier to run the application where the data already is located (data processing and machine learning services).

� Lifecycle management and operations: installing and upgrading applications. � Release cycle management: packaging, publishing and versioning releases. � Production readiness: security, licensing, monitoring and high availability.


LIFECYCLE MANAGEMENT AND OPERATIONS

Avoiding Deployment Fragmentation Delivering an on-prem offering in addition to an existing, hosted offering can result in two different deployment modes. This leads to a bifurcation of team responsibilities and can double the amount of work.

By migrating to Kubernetes and its native package manager, Helm, you can unify both deployments. When used as the primary platform, Kubernetes abstracts away the details of the underlying infrastructure such as disks, load balancers and network security rules. Helm splits components into independent packages.

Once the migration to Kubernetes and Helm is complete, the on-prem edition becomes just another deployment target alongside cloud deployments.

Many of our customers use Gravity’s supported upstream Kubernetes as a deployment target for on-prem deployments and a managed Kubernetes service like GKE or AKS for their cloud deployments.

Simplifying Installation Complexity Installing a highly available, distributed system is difficult on an infrastructure you control, not to mention infrastructure that you don’t. Setting up dozens of components and dependencies leads to a multi-step installation process that is very hard for untrained, on-site personnel to debug and execute.

Completing an installation may take days and numerous attempts to get right. Some installations will fail and it will take many hours of back-and-forth with the customer to troubleshoot the root cause. Eventually, customers may entirely abandon the idea in frustration, which can damage your reputation.

The key to success is automating as many steps in the installation process as possible, thereby remove the human factor. It can be difficult to automate the installation of complex applications for some environments. It’s wise to limit the types of supported environments and components so that you can safely automate. (See “Handling Incompatible Resources” below). You should also set up an easy way to log and share information externally for debugging purposes.

Support teams should have the ability to roll back the installation to the point of failure rather than restarting from scratch. This can save many hours, particularly when failures occur late in the process.

Gravity automates the installation of Kubernetes along with all dependencies and application containers, reducing the number of installation steps to one command. It lets users manually override the automated installation if a failure occurs, and includes a simple way to create comprehensive operational reports.


Handling Incompatible Resources In an infrastructure you don’t control, installation problems can be caused by a variety of factors such as slow disks, slow networks or an old OS distribution. These can lead to hours of troubleshooting, because it will be unclear why the installation failed to work with a seemingly correct setup and configuration.

To avoid these problems, always specify and enforce requirements for disk space and speed, network speed, and the requisite OS distribution. The system should refuse to install unless these requirements are met, and it should be able to indicate the requirements that are not met. Guidance provided in the form of documentation rarely works, because documentation is often ignored. We use a set of pre-checks that specify the acceptable operating systems, disk speed and capacity, network bandwidth, as well as the requisite open ports.

You should also equip your services teams with lightweight tools to pre-check system readiness, such our Gravity status tool. These should be run before the installation has begun to make sure that basic requirements are satisfied.

Here is some advice on more specific requirements to consider:

Reducing Upgrade Complexity Installing distributed systems is difficult but upgrading them is an order of magnitude more complex. Sometimes, only certain components of the system need to be upgraded but it may be difficult to only upgrade those components in a safe way.

Upgrade failures can turn into quagmires. During an upgrade operation to a local environment, there is no easy way to reinstall the OS or add new nodes to the rotation. Any part of the upgrade can fail at any time due to known or unknown circumstances like power outages, systems running out of disk space or simply containers

� When using Kubernetes, require a separate disk for etcd (the internal Kubernetes database) and any other database that you ship. This requirement can be lifted for trial deployments but should definitely be included for production deployments.

� Isolate slow network-attached storage by enforcing minimum performance requirements for storage volumes. At a minimum, pick something as low as 20 MB/s just to eliminate completely incompatible or broken storage.

� Always set up capacity requirements for temporary and root partitions and database partitions. You will be surprised how often you will get VMs with minimal disk space available if you don’t.

� Apply baseline network throughput requirements. Setting something as low as 5MB/s will spare you from troubleshooting congested networks.

� Specify and encode all networking and port requirements needed for the application to run. � Start with one or two of the most popular supported OS distributions. Typically, larger customers

have RHEL available. This will spare you from troubleshooting a range of 5 different distros and kernels. Here are our guidelines on supported distributions, for reference.


hanging because of older kernels. Complex updates will contribute to longer upgrade cycles, as customers will be wary of the risk of spending 2-3 days upgrading the system.

Our upgrade process consists of a single command that launches a full cluster and application upgrade. If the upgrade fails, it can be easily resumed from the explicit stage it last completed, instead of restarting from the beginning.

This approach lets users continue the upgrade in the face of unexpected failures and also keeps the cluster running during failures. It makes a good impression on the customer by letting them know at which stage the upgrade failed and providing insight as to why it happened. This is difficult with a black box upgrade procedure.

Handling Application Upgrade Failures Having a platform like Gravity is helpful but it’s not a magic bullet. If the application is not architected correctly, an upgrade can lead to failed database migrations and lost data, which can lead to many hours of troubleshooting and rollbacks.

We offer guides and training on implementing proper application upgrade procedures. Here is just one example:

During the upgrade process, we strongly advise taking automatic backups of the system and draining off the write traffic to the database to avoid conflicts during migration. In addition to that, we recommend using a test suite like robotest to roll automated regression and upgrade testing with every code and deployment change.

We also offer upgrade hooks that your application can use with Gravity. Here is a sample application upgrade process that can be automated with the upgrade hook:

� Run migrations as a separate process for the cluster instead running them as part of an individual service startup.

� Switch product landing page and API endpoints to show an “upgrade page.” This prevents writes to the database during migration process.

� Drain off the traffic to databases. � Make a backup of the data. � Run migrations on the database. � Check that migrations are run safe by using a simple sanity test. � Upgrade services. � Switch the traffic back to the services from the landing page.


Creating Consistent Application Environments Even though Kubernetes and Docker can abstract away infrastructure differences, you must still ensure they are consistent across deployments. Each version (Kubernetes, Docker, your app and its dependencies, etc.) is slightly different and you will encounter slightly different behavior with various combinations of software, OS distributions and storage engines. This introduces fragmentation and your ops team will constantly need to question customers about component versions and configurations.

To avoid this, we create a “bubble of consistency” by using the following methodologies:

Taking these steps means that your support and services teams will never have to ask questions about which Docker or dnsmasq version is installed, because the packages are predetermined and tested for reliability and supportability.

� We package Kubernetes and all of its dependencies, including etcd, docker, dnsmasq, system and others, and we test to make sure they are compatible before installation. This helps to ensure that conflicting software is not running on the host during the installation.

� We isolate the processes running in a special Linux container. This minimizes interaction with distribution packages.

� The runtime section of our application manifest sets up approved Docker storage drivers that are production-ready and can work reliably without losing data.

� We support the most popular OS distributions and we specify other requirements. Components are tested before each release.


MANAGING RELEASE AND UPGRADE CYCLES

Shortening Upgrade Cycle Times One of the biggest shocks SaaS companies encounter when delivering software on-prem is the long length of release cycles. SaaS businesses are accustom to multiple-times-a-day release cycles. Shipping versions on-prem, even with bi-weekly updates, poses a challenge for them, especially if the system is a mix of micro services with loosely coupled release cycles.

Many customers are simply wary of upgrading complex systems because they often break and require full reinstalls and/or lead to an outage. As a result, these customers often fail to keep versions up to date, with upgrade cycles as long as one year. This puts a lot of strain on the team that has to support older versions of the software. However, if you can provide a method for simple and stable upgrades, teams are usually open to more frequent upgrades, and it is possible to get down to bi-weekly upgrade frequency with most of your customer base.

Managing On-Prem Releases and Versioning It is important to use the same platform for your on-prem and cloud deployments. We recommend using Kubernetes for both.

Picking a right versioning scheme is mission critical for on-prem deployments. Unlike in SaaS deployments, versioning plays a very important role because it is used to inform customers about the frequency of software release cycles and the risks associated with the upgrades.

For versioning, adopt semantic versioning and set up clear dependencies between components. Use clear signaling to the customer on the upgrade risks by using major, minor and patch versions of the software. For example, with semantic versioning customers would expect patch versions 2.5.1 and 2.5.2 upgrades to be trivial and backwards compatible, upgrades from 2.6.3 to 2.7.3 be possible but a bit more risky and involving potential migrations, and upgrading 2.0.0 to 3.0.0 to be a major undertaking.

For packaging, use Helm, the Kubernetes package manager and its best practices to transition microservices releases to a package-style approach with clear dependencies. We have a first-class integration with Helm to simplify the build and deployment process.

Upgrade Scheduling Establish a stable upgrade schedule for customers. This allows them to set up planning on their side and to include upgrades in their development milestones. Here are a couple of recommendations on the upgrade schedules:

� Publish bi-weekly upgrades for stateless services, so that most of the customer’s deploys are up to date. These upgrades should not run any migrations or perform any dangerous or risky operations.

� Provide more complex upgrades that involve database schema migrations on a monthly schedule.


The Golang programming language is a very good example of a team publishing upgrades on a predictable schedule – in this case, every 6 months. We upgrade the platform (Kubernetes and dependencies) approximately every two months using Gravity’s LTS release upgrade schedules.

Managing External Dependencies Many on-prem deploys are air-gapped, which means they cannot make any outbound internet calls to function or update. This makes it impossible to provide installations, patches or updates if they pull dependencies from external resources.

To solve this problem, we designed our deployments to be entirely self-sufficient using the following methodologies:

Publishing Installable Software SaaS companies are usually not familiar with the process of publishing downloadable software. Sending out binaries without an official, centralized process to download and validate software appears unprofessional and results in a bad user experience. Customers end up sharing FTP password-protected endpoints and sending passwords over email. In addition, there should be seamless process for sending out updates, patches and monitoring the status of each download.

To ensure a positive customer experience, our method enables customers to publish applications so their users can install, download and pull updates manually (for offline situations) or automatically, depending on their security and deployment practices.

� The build process scans all Kubernetes Docker image dependencies and packages them with every install. See the documentation on tele build.

� We ship a self-contained Docker registry that hosts the images, so a cluster remains highly available and can pull images from local registries instead of pulling from the internet.


THE ROAD TO PRODUCTION

Managing Open Source Software Dependencies You will frequently be asked to provide a full list of your third-party software used with all the versions and dependencies shipped with your product. This is to make sure there are no copyright infringements and to reduce the likelihood of vulnerabilities. Meeting this requires involves scanning the product for licenses, collecting all the software versions and assessing the license dependencies, a time-consuming task that can block a deal until completed.

We recommend using Fossa to set up on-going scans for every pull request. If you do come across a restrictive license, we recommend checking in with a copyright lawyer. You can also reference TLDR to educate yourself on the most common licenses.

For Docker containers, use a private registry with security scanning capabilities that can show the software and all common vulnerabilities reported (CVEs). Quay.io is a good example.

How to Pass Security Audits Most likely, one of the major reasons your customer requested an on-prem install was to meet tight security requirements. A full security audit may be required for obtaining a green light for the production deploy, especially if your customer is a regulated entity like a bank or government agency. You may need to redesign a deployment on short notice if there are vulnerabilities discovered.

Infrastructure security audits vary in their level of thoroughness, but they usually consist of network security scans and application black box scans. Here are some important steps to take to make a Kubernetes application ready for a customer-driven external audit.

Network security. A network security scanner will identify any ports that respond with plain text HTTP or are using weak ciphers or older protocols like SSLv2. An application security scanner will discover basic vulnerabilities, for example if server discloses a version to non-authenticated clients or contains dependencies with versions known to be vulnerable to CSRF attacks. In addition, a security auditor may conduct a more advanced review by trying to find hard-coded secrets in the code or break into the application.

Application security. Important steps include:

� Set up mutual TLS in your application using side-car patterns. As a rule of thumb, there should be no unencrypted data floating between servers.

� Do not use the same static passwords/API-keys for every install. Generate them on the fly during the installation process.

� Disable weak ciphers, using Mozilla’s recommendations as a starting point. � If the web page or endpoint is external (customer-facing), make sure TLS ciphers and certificates are

configurable as all large customers have their own guidelines and requirements. This is a common gotcha with TLS.

� Focus on common web security issues by going through OWASP Top 10.


Once your product is ready to be made available for wider distribution, it is helpful to engage a third-party security review agency to conduct an external review. We recommend Cure53. We have had positive experiences working with them over past years and they will publish their work upon request.

Kubernetes-specific security. Kubernetes deployment security has its own deployment gotchas that will be important at the time of audit.

Set up a restrictive Kubernetes deployment by following fine-grained security policies. For example, make sure that containers are not privileged and running as root if they don’t need to be.

Use Kubernetes secrets to store infrastructure secrets like API keys and database passwords.

If the application is not ready to set up and handle TLS in a scalable way on its own (for example Python or Node.js services), it is helpful to set up a proxy sidecar container terminating TLS and sending traffic to the local app. Read more on side-car containers here.

Gravity itself is reasonably audit-ready. It uses mutual TLS on the control plane and follows the security best practices for Kubernetes deployments.


Managing Evaluations and Monitoring Usage Evaluations. Selling downloadable software requires a certain level of trust. Our position is that if someone really wants to pirate your software, they will likely succeed. Instead of spending expensive engineering cycles creating “unhackable” software, we recommend limiting your dealings to reputable customers who would not risk their reputation by knowingly using your software illegally.

Initially, you may not need license enforcement to cover all use cases, but a time based “reminder” flow for trials is a good minimal implementation. Be aware that without enforcement, evaluation or POC periods can end up extending beyond the intended time frame, resulting in longer sales cycles.

Gravity has a way to define a limited trial license in the application manifest. This will shut down the software or limit the number of servers it is used on during the trial period to motivate the customer to close faster.

Monitoring Usage. It’s much easier to monitor usage of hosted software than installable software. With installable software, you need to develop a way to monitor and enforce usage according to the license.

Many customers will not want to report usage back to you automatically, as one of the reasons for running the application on-prem may be data privacy. You’ll therefore need to develop and insert some other reporting mechanism into the contract to monitor and enforce usage according to the license. Many customers will send quarterly summary reports. In general, usage is usually bucketed into tiers or plans so that fine-grained usage reporting is not necessary.

There are several third-party vendors that can take care of license enforcement. In our experience they are either too complex or designed for legacy software, making their adoption for SaaS offerings a challenge.

Gravity Clusters come with a fully configured and customizable monitoring/alerting system by default, consisting of Prometheus and Grafana.

Managing Database Deployments It is very difficult to deploy a traditional database on-prem in a highly available manner without risking data loss. Unfortunately, Kubernetes does not bring an out-of-the-box solution to the problem.

There are entire books written on this subject. To keep it short, if you don’t have significant in-house expertise with a database, find a good partner that will provide and support a production-ready deployment of that database on Kubernetes. We partner with Citus Data to deliver production ready HA Postgres with on-prem deployments.


Simplifying Kubernetes Operations Kubernetes is a complex system that consists of a distributed database (etcd), overlay network (VXLAN), container engine (Docker), Docker registry and many other components like iptables rules that must be kept in mind. A successful install is just the beginning of the customer relationship. The platform will inevitably degrade over time. Here are just some of the problems we have encountered in the past:

What happens if an installation fails? How do you troubleshoot when you don’t have the access to the infrastructure? How does the customer even know if the platform is in degraded state? There are no easy solutions for these problem, but they’re not insoluble.

Our tool, Gravity status, helps diagnose the most common reasons for cluster failure, reducing time to resolution. It provides fast checks on some common outages that we have seen in the past. Gravity uses our monitoring system, satellite, which constantly checks the parameters of the system, not only during the install, but after the platform has been set up. In addition, Gravity provides integrated alerting.

We offer training for field teams to help them understand Kubernetes and Docker architecture so they can become more efficient during troubleshooting sessions with customers.

Recovering from Failures Recovering a partially failed system can be harder than setting up a new one, as you don’t have fresh hardware to begin with and have to repair the system in place. In the absence of published runbooks, service teams will struggle to provide fast assistance to the customer.

To help, we have published a series of runbooks targeting the most common cluster failure and recovery scenarios. We review the runbooks with customers, breaking clusters and recovering them, so service teams are comfortable providing assistance on the spot.

� Security teams automated port blocking, stopping services without warning. � A customer set up monitoring daemons that consumed all the RAM on the host. � A system ran out of disk space. � A customer’s DNS server blocked queries.


Monitoring and Troubleshooting Deployments Actively monitoring a multitude of on-prem deployments is difficult. You may not even have access to the deployments. In order to provide proper support, you need a consistent and scalable way to assess the situation and troubleshoot your deployments when issues arise.

Metrics and Alerts. These two are closely related, as anomalies in metrics trigger alerts. Gravity integrates with the TICK stack and Grafana to create built-in application dashboards and alerts. The Google SRE book has great advice on setting up proper alerting and monitoring in the application. Here are some tips on how to set it up with Gravity:

Logging. The 12-Factor app manifesto provides good guidance on setting up logs as structured event streams. Docker and Kubernetes make it easy to send logs for every application by capturing logs sent to stdout and stderr. We use a 12-factor setup when deploying applications with Kubernetes, so the logs can be captured later. We can forward logs to the endpoint of customer choice using a log forwarder configuration.

Status checks. Application metrics and alerts are great for debugging, but most of the time customers only need an answer to one question: “Is everything up and running?” That’s why it’s important to provide “self checkers” or “smoke tests.” These are programs running in the cluster to make sure everything is in a good state. Once checkers detect any failure, they make the customer aware that the system is in a degraded state via the UI and alerts.

Our customers write application-specific “smoke test” programs and integrate them with status hooks to give customers a clear, visual notification that the application has been degraded.

Sending reports. In most of the cases there is no easy way to access on-prem deployments, so we use our cluster management tooling to take a snapshot of all system logs and metrics, and ship it to the development and support teams for inspection.

� Use a TICK stack integration to ship pre-built dashboards. � Set up built-in alerts using Kapacitor integration. � Set up retention policies and rollups for application metrics, or use the ones shipped with Gravity by

default.


Setting Up Data Storage Your customer may not have any block storage available for use. Even if it is available, integrating with it in a timely manner. may not be possible. Here are some general storage recommendations:

Maintaining Data Integrity Elastic block storage solutions at cloud providers hide the frequency of data corruption by using software and hardware-powered data replication strategies. When going on-prem, this often won’t be available. As a result, you will encounter data corruption much more often.

We use the Gravity backup subsystem to backup and restore the important application state. We set up alerts to detect the absence of backups for a period of time to make sure they are happening.

Backups should be external, i.e. stored outside of the cluster’s storage. This makes it possible to quickly recover the system in case of data corruption. Solutions like ZFS snapshotting on the same disk won’t work when the disk itself is corrupted.

We test backup and restore functionality for every release in an automated way with the robotest suite to make sure that backups work as intended. Otherwise every release can have regressions.

� You may need to rely on disk storage as the only available storage solution. Use Kubernetes host volumes for local disks and provide clear requirements to use disk storage with Gravity’s application manifest.

� If the customer has an external NFS server, provide integration NFS endpoints powered by Kubernetes pluggable volumes to connect your application to it.

� Use clustered database deployments that are designed to work on bad hardware, like the Cassandra-powered S3 storage system, Pithos. Avoid using unproven and experimental storage systems that are designed to work with Kubernetes. Also avoid systems with a large operational footprint like Ceph unless you have an in-house Ceph team to handle the support load.

� For services doing simple metadata storage, consider using custom resources provided by Kubernetes. Custom resources provide a powerful abstraction, generating versioned, secure API with RBAC using etcd as a backing storage.

� Avoid deploying risky storage methods unless you have seasoned data storage expertise on the team. As a rule of thumb, try not to experiment with data storage combinations as a part of the on-prem release. Make sure that any deployment with mission critical data is vetted with a storage expert.

� Consider the operational costs of any database. For example, the ELK stack is easy do deploy, but extremely difficult and expensive to manage.


Accessing Deployments Remotely Many customers will not allow remote access to their infrastructure. The ones that do will that require robust security measures are in place.

In the cases where remote access to customer’s infrastructure is not possible, we rely on automated cluster management tools to get a snapshot view of the customer’s infrastructure. In addition, we provide them with training on how to use runbooks to solve problems, with escalation available.

In the cases when access is possible, here are some guidelines about how to deal with the restrictive requirements you’re likely to encounter.

We built Teleport to meet these requirements. Teleport is fully integrated with Gravity and adds abilities to fine-tune remote access management.

� Be able to segment the time and duration of the access. � Use role-based access controls to limit access privileges. � Never open any inbound internet-accessible ports. � Audit and record every action performed. � Use second factor authentication and have ability to revoke access completely. � Use approved crypto standards and protocols and turn off weak ciphers.


Network Configuration Kubernetes ships with a specific set of requirements for overlay networks. Customers can face problems especially in cases of complex network topologies. For example, they could experience problems making custom subnet ranges routable within their data center.

Gravity uses the simplest possible overlay networking for Kubernetes, VXLAN. It encapsulates all traffic in UDP packets. It does not need any special routing and only needs the simplest connectivity between machines. You can read more about VXLAN here.


MANAGING ORGANIZATIONAL ISSUES

Setting Expectations with Customer IT Organizations In on-prem deployments, the operation of the application is akin to a partnership between the vendor and the customer. Many times, the customer will attempt to solve initial issues before escalating. In addition, it may not be clear if the problem is due to the customer’s infrastructure or the application. Most customer’s IT departments have been through frustrating episodes of trying to support other vendors’ applications. There are some proactive steps you can take to alleviate their fears of supporting yours.

We highly recommend having a checklist that service teams can go over with the customer. Sharing this information will arm the customer’s IT team responsible for running the application with a clear production roadmap and give them peace of mind when going to production.

The exact content of the checklist will vary significantly depending on the service and access level involved. Here is a sample production checklist that you can use as a starting point:

System access

Backups and alerting

� The customer has set up external SSO in advance (if it’s necessary/available). In practice, many customers delay this step and only contact support when something needs troubleshooting. You want to prevent this by having access setup beforehand.

� Platform alerting has been integrated with the customer’s alerting system. Most customers have email integration that can trigger alerts for their team. Make sure the integration actually works by triggering an alert.

� Backups have been configured to external devices outside of the cluster. If backups are local, disk corruption will bring down both local data and backups.

� Alerts are being sent if no backups have been done for 30 minutes or more. Many customers set backups and forget about them until the problem occurs. Unfortunately this is usually too late, as the backup script on the customer side could be broken. We recommend testing by breaking a backup script and waiting for the alert to occur on the customer site.

� Clear procedures on how to back up and restore the application from scratch are in place. Customers will have peace of mind if they know how to completely recover the platform in the worst case scenario, so it is helpful to review this process with them. This will also reduce the support load, as they can recover the platform when they detect a problem.


Monitoring and troubleshooting

Availability expectations

� Logs are being forwarded to customer’s infrastructure logger of choice. Security and ops teams on the customer side would want to capture the most important logs from the application. Make sure they can do that by setting up logging and making sure logs show up.

� The customer is aware of the simple recovery and troubleshooting runbooks on the cluster. Try out a couple of simple scenarios with the customer, e.g. how to check that everything is running and how to interpret basic output commands.

� The customer knows how to check that the application is up and running using a status hook. Show the ops teams where to look first to get application status in the user interface and console commands.

� The customer has been walked through all built-in dashboards and charts and knows how to interpret the health dashboard of the system. Explain the meaning and significance of built-in monitoring dashboards, e.g. how to read the memory and CPU utilization of the cluster.

� System resilience expectations have been communicated to the customer, (e.g. the system can lose 1 node out of 3). As a rule of thumb, follow the optimal cluster size guidelines from Etcd admin guide. Customers usually don’t have a clear understanding of the HA concepts, so make sure to communicate that they can only lose 1 node out of a 3-node cluster to keep the system running and recoverable.

� Cases when restore will be required have been communicated with the customer. (e.g. the majority of the servers with a database are lost). As a follow-up to the previous point, make sure customers know when the system has to be recovered from the backup, e.g. in case 2 out of 3 nodes are lost on a 3-node cluster. This will help to set the right expectations before the first support call happens.

� Application support and the EOL cycle for every version has been clearly communicated to the customer. Make sure customers know when to upgrade and when releases will not be supported any more. Publish a web or wiki page with release schedule and EOL times.


Advanced: Fire Drill exercises This is a more advanced, but highly recommended section of the checklist. Services team should conduct basic fire-drill exercises with the customer if they are sharing operational responsibility, showing basic failure/recovery scenarios:

Assuming your on-prem offering is successful, be aware that organizational issues may arise as it matures. The most common is fragmentation into two separate teams, the “cloud team” and the “on-prem team”.

We recommend avoiding this if at all possible. A better approach is to train your existing teams to fully migrate to Kubernetes as the only deployment platform supported in the company. Use rotation for ops and services engineers to support cloud and on-prem cases based on bi-weekly or monthly rotation schedule. Set up a rule for the teams to use the exact same deployments for on-prem and cloud applications. This will eliminate fragmentation, as the same deployment will be done in the cloud and in the on-prem. In our experience, this is an important help to maintain team morale.

� Perform a system reboot and check that system is up and running. Check the system health after reboot/OS upgrade.

� Recover the failed hardware node on n >= 3 node cluster. Make sure team knows how to remove the faulty node from the cluster and add a replacement.

� Troubleshoot a basic disk/CPU pressure fire drill exercise. It is very easy to simulate the CPU pressure by running some CPU consuming process as a container and on the host. Make sure the team can spot the process quickly.

� Show the customer how to troubleshoot basic networking problems by turning the firewall rules on and showing the basic diagnostic tooling output. Make sure the customer can run the simple gravity status tool to see that there is a network problem.


CONCLUSION

In this guide, we have attempted to outline the myriad complexities that companies will encounter when going on-prem. Of course, there’s only so much that can be covered in 16 pages, but we hope this provided a jumping off point for discussion around the issues you’re most likely to encounter. Should you have more questions, please drop us a line at [email protected].

Oakland, CA – USA | Toronto, Ontario – Canada | Munich, Germany

Tel: (855) 867 2538 | [email protected] | www.gravitational.com

About Gravitational

This guide was created by Gravitational. Gravitational builds open source software solutions to deliver, access and manage cloud-native applications and infrastructure. Teleport is our security gateway for managing privileged access to server infrastructure through SSH and Kubernetes. Gravity is our Kubernetes appliance to package, publish and deliver cloud-native applications across cloud and on-premises environments. You can reach us at [email protected]. If you’re looking to survive on-prem, you can also download the open source community edition of Gravity on our site or request a demo of Gravity Enterprise.

the survival handbook for deploying to on-premise 080819 (1) · centers in the world, we put...

Documents