containerizing databases at new relic (what we learned) · introductions!3 new relic • a cloud...

73
Bryant Vinisky and Joshua Galbraith Santa Clara, California | April 23th – 25th, 2018 Containerizing Databases at New Relic (What We Learned)

Upload: truongthu

Post on 15-Nov-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Bryant Vinisky and Joshua Galbraith

Santa Clara, California | April 23th – 25th, 2018

Containerizing Databases at New Relic (What We Learned)

!2

Safe Harbor

This presentation and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission.

Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,” “expects” or words of similar import.

Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings New Relic makes with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at ir.newrelic.com or the SEC’s website at www.sec.gov.

New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this presentation or otherwise, with respect to the information provided.

Introductions

!3

New Relic • A cloud platform to make every

aspect of modern software and infrastructure observable!

Bryant Vinisky • Senior Site Reliability Engineer • Database Engineering Team

Joshua Galbraith • Senior Software Engineer • Database Engineering Team

Agenda• Where We Started • Research • Prerequisites • Megabase • Monitoring • Lessons Learned • Outcomes

!4

Where We StartedDatabases at New Relic circa 2016

Database Management Issues

!6

📦 Using Puppet for configuration and deployment 🚚 Slow delivery time due to timing of hardware orders 💸 Inefficient hardware use

Why Containers?

!7

• Why not use virtual machines instead?

• Why not use AWS? • Why not multi-tenant

logical databases?

Preparing for the Future

!8

🤔 New Container Fabric for deploying our stateless apps 🚢 Future regions will be entirely container-based! 🛤 Incremental delivery of containerized databases?

Setting Goals

!9

📦 Packaging and Deployment • consistent and repeatable

🚚 Database Delivery Time • from months to minutes

💸 Cost Efficiency • reduce wasted resources

Compressed Timeline

!10

😅 Managing hundreds of existing, busy databases (Need to ship an MVP and gain traction quickly ) Dev work is blocked on database delivery time

ResearchA Survey of Open-Source Orchestration Tools

!12

Open-Source Container Orchestration

!13

Stateful Containers: Mesos and MarathonKey Concepts

• Dynamic Provisioning • Reservation Labels • Local Persistent

Volumes • External Volumes

!14

Stateful Containers: KubernetesKey Concepts

• Stateful Sets • Pods • Headless Service • Persistent Volumes • Operators

!15

Stateful Containers: NomadKey Concepts

• Jobs • Task Groups • Allocations • Sticky Volumes • Volume Plugins • FS Drivers

!16

Joyent Blog: Autopilot for Databases

!17

Stateful Containers: Emergent PatternsAbstract Patterns

• Application-aware orchestration

• Autopilot pattern for lifecycle management

• Storage fabrics and networked storage 😞

!18

Stateful Containers: Current Status

Making Trade-Offs

!19

• Dynamic Scheduling vs. Manual Placement

• Custom Orchestration Framework vs. Client-Server

• Distributed Consensus? • Local Object Storage? • Lifecycle Management logic

inside or outside Container?

PrerequisitesDynamic Inventory and/or Service Discovery

!21

Problem: Database Inventory

What systems do we have?

What logical databases?

What team is the owner?

Naming standards?

API and CLI access?

!22

Megabase Prerequisite: Inventory SystemMetadata on DB systems and services

Percona containers on Megabase

Golang service HTTP interface to DB

Update on event and scheduled jobs

Seed automation tasks

!23

Inventory System: Database• Query read only data from other

systems and authorities • Megabase container deployment

information • Provide basic service discovery

!24

Inventory System: HTTP Interface (JSON)

Query over HTTP

Return JSON

!25

Inventory System: HTTP Interface (text)

!26

Inventory System: Dynamic Lookups

MegabaseA platform for containerized databases at New Relic

!28

Megabase: What We Built

Megabase: Ingredients

!29

Bare Metal: • Kernel 4.4 and 4.14 • CentOS 7 and CoreOS

Docker: 1.12 and 1.17 Image OS:

• Alpine (Postgres and Redis) • Debian (Percona)

Data stores: • Percona-server 5.6 • Percona-server 5.7 • Postgresql-server 9.3, 9.5 and 9.6 • Redis-server 3.2

Golang: 1.9

Megabase: Docker Image Building

!30

Dockerfile • Upstream base image • Custom labels • Package dependencies • Add binaries: - rclone - configuration sync - replication bootstrap

• Entrypoint script

Megabase: Docker Entrypoint

!31

Tasks • Validate data mount • Sync base configs from S3 • Apply dynamic configuration • No data => replication

bootstrap/initdb • Start server processor

Megabase: Injecting Configuration

!32

Strategies • Environment variables - via deploy config

• Configuration files - via object storage - version controlled

• Dynamic computation from cgroup limits

• Docker images - via image registry

!33

Megabase API Server

Pre/Post deployment dependencies Manage container runtime dependencies Interface to docker over https Support custom workflows for database services megabase server

!34

Megabase API: TLS AuthenticationAuthenticate all requests with TLS client certificates

!35

Authentication Failure

!36

Authentication Success

!37

Megabase API: Endpoints

!38

Megabase API: DependenciesContainers expect and validate bind mount under data directory Megabase deploy pre-step to docker run: • Carve off extents for logical volume and make filesystem • Create systemd unit file for persistence and mount volume

!39

Megabase API: Create Logical Volume

!40

Megabase API: Container RuntimeReduce client config burden Offload predictable defaults: • Resource Limits • Bind mount volume to data directory • Host networking • Inject environment secrets

Example docker run on cli =>

!41

Megabase API: Failover Tasks

!42

Megabase API: Xtrabackup Stream to Peer

!43

Megabase… Client?

“So we’re just supposed to type out those nasty, long curl commands to operate everything? Really? And TLS client auth too? Are you trying to be annoying?”

- anonymous New Relic db-team member

mb client

!44

Megabase Client: mb

!45

Megabase Client: mbmb: Command Line Interface • wrap up server API functionality into command line interface • transparently handle TLS authentication • query state, health and configuration across servers and deployments • coordinate deployments across servers • leverage inventory to resolve host and container names

!46

Megabase Client: mb

!47

Megabase Client: Deployment ManifestMinimum info required: • Container name

• Image path and tag

• CPU, memory and disk resources

• VIP for load balancer and DNS

• Owning team

Auto select values if absent

Define one or more containers

!48

Megabase Client: mb DeploymentProcess

• Target environment and cluster • Specify a deployment manifest • Generate port and passwords • Send requests to API servers - new logical volume - run container

• Check responses • Validate deployment - health

MonitoringUsing New Relic to Monitor New Relic

!50

Monitoring and AlertingNewRelic APM, Infrastructure and Insights • Hardware/System, Percona, Postgres, Redis, Megabase/Golang • Connection availability monitoring events to Insights • Insights dashboards created for deploys

!51

Monitoring: Insights Dashboards MySQL

!52

Monitoring: Insights Dashboards Postgres

!53

Monitoring: Insights Dashboards Redis

!54

Monitoring: Insights NRQL

!55

Monitoring: Service Availability

!56

Monitoring: APM Transactions

!57

Monitoring: Go Runtime

Lessons LearnedOur First Year of Running Databases in Containers

!59

Concern: Host Failure Blast Radius

Density of database services per host Maintenance (still) happens Hardware (still) fails Time to recover after host failure Time spent on failovers

!60

…and then it happened.Production host went down RAID controller failed Several active primary instances affected Time to recover was higher than we expected Started DRI sprint to address host failures

!61

Fix: Improve Tools for Mass FailoverTooling updated to support failover: • All unhealthy database pools: ‘failover pools unhealthy’ • By megabase hostname: ‘failover host demo-megabase-1c’

!62

Redis Tuning Memory LimitsRedis deploy for experiment with RDB bgsave enabled Keyspace sizing:

• docker update used to walk memory limits up to 2GB, 4GB and 16GB

• redis maxmemory limit increase to 1.5GB

!63

Redis PRIMARY Erratic Memory

!64

Kernel OOM on redis-server

!65

Replica Swap Extends RDB Save Time

!66

Fix: Adjust Alert PoliciesNew Relic NRQL Alerts • Adjusted swap alert • Added alert on kernel

version mismatch

OutcomesWhat We Delivered at New Relic

!68

Keys to Our Success• Internal team supporting internal customers • Scoped to meet our specific goals • Controlled slow roll out and adoption • Team autonomy and control of our hardware • Existing APIs and tools from other teams • Limiting technologies and versions involved • Balancing trade-offs • Minimal container resource limits

Outcomes

!69

📦 Packaging and Deployment • is consistent and repeatable

🚚 Database Delivery Time • takes minutes not months

💰 Cost Efficiency • fewer wasted resources

🌍 New Regions are Easy* • single command mass deploys

!70

Megabase Adoption

!71

References• https://mesosphere.github.io/marathon/docs/persistent-volumes.html • https://docs.mesosphere.com/1.11/tutorials/stateful-services/ • https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ • https://youtu.be/J-Ke0TxGUSg (Kubernetes StatefulSet) • https://coreos.com/blog/introducing-operators.html • https://youtu.be/faUQcd5_MUc (Towards Running Stateful Applications on Nomad) • https://github.com/hashicorp/nomad/issues/150 • https://twitter.com/kelseyhightower/status/963415653930553345 • https://twitter.com/kelseyhightower/status/963418681148502016 • https://www.joyent.com/blog/dbaas-simplicity-no-lock-in • https://www.joyent.com/blog/persistent-storage-patterns • https://thenewstack.io/methods-dealing-container-storage/ • https://techcrunch.com/2015/11/21/i-want-to-run-stateful-containers-too/

Thank You!Your questions are now welcome.

!73

Rate My Session