containerizing databases at new relic (what we learned) · introductions!3 new relic • a cloud...
TRANSCRIPT
Bryant Vinisky and Joshua Galbraith
Santa Clara, California | April 23th – 25th, 2018
Containerizing Databases at New Relic (What We Learned)
!2
Safe Harbor
This presentation and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission.
Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,” “expects” or words of similar import.
Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings New Relic makes with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at ir.newrelic.com or the SEC’s website at www.sec.gov.
New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this presentation or otherwise, with respect to the information provided.
Introductions
!3
New Relic • A cloud platform to make every
aspect of modern software and infrastructure observable!
Bryant Vinisky • Senior Site Reliability Engineer • Database Engineering Team
Joshua Galbraith • Senior Software Engineer • Database Engineering Team
Agenda• Where We Started • Research • Prerequisites • Megabase • Monitoring • Lessons Learned • Outcomes
!4
Database Management Issues
!6
📦 Using Puppet for configuration and deployment 🚚 Slow delivery time due to timing of hardware orders 💸 Inefficient hardware use
Why Containers?
!7
• Why not use virtual machines instead?
• Why not use AWS? • Why not multi-tenant
logical databases?
Preparing for the Future
!8
🤔 New Container Fabric for deploying our stateless apps 🚢 Future regions will be entirely container-based! 🛤 Incremental delivery of containerized databases?
Setting Goals
!9
📦 Packaging and Deployment • consistent and repeatable
🚚 Database Delivery Time • from months to minutes
💸 Cost Efficiency • reduce wasted resources
Compressed Timeline
!10
😅 Managing hundreds of existing, busy databases (Need to ship an MVP and gain traction quickly ) Dev work is blocked on database delivery time
!13
Stateful Containers: Mesos and MarathonKey Concepts
• Dynamic Provisioning • Reservation Labels • Local Persistent
Volumes • External Volumes
!14
Stateful Containers: KubernetesKey Concepts
• Stateful Sets • Pods • Headless Service • Persistent Volumes • Operators
!15
Stateful Containers: NomadKey Concepts
• Jobs • Task Groups • Allocations • Sticky Volumes • Volume Plugins • FS Drivers
!17
Stateful Containers: Emergent PatternsAbstract Patterns
• Application-aware orchestration
• Autopilot pattern for lifecycle management
• Storage fabrics and networked storage 😞
Making Trade-Offs
!19
• Dynamic Scheduling vs. Manual Placement
• Custom Orchestration Framework vs. Client-Server
• Distributed Consensus? • Local Object Storage? • Lifecycle Management logic
inside or outside Container?
!21
Problem: Database Inventory
What systems do we have?
What logical databases?
What team is the owner?
Naming standards?
API and CLI access?
!22
Megabase Prerequisite: Inventory SystemMetadata on DB systems and services
Percona containers on Megabase
Golang service HTTP interface to DB
Update on event and scheduled jobs
Seed automation tasks
!23
Inventory System: Database• Query read only data from other
systems and authorities • Megabase container deployment
information • Provide basic service discovery
Megabase: Ingredients
!29
Bare Metal: • Kernel 4.4 and 4.14 • CentOS 7 and CoreOS
Docker: 1.12 and 1.17 Image OS:
• Alpine (Postgres and Redis) • Debian (Percona)
Data stores: • Percona-server 5.6 • Percona-server 5.7 • Postgresql-server 9.3, 9.5 and 9.6 • Redis-server 3.2
Golang: 1.9
Megabase: Docker Image Building
!30
Dockerfile • Upstream base image • Custom labels • Package dependencies • Add binaries: - rclone - configuration sync - replication bootstrap
• Entrypoint script
Megabase: Docker Entrypoint
!31
Tasks • Validate data mount • Sync base configs from S3 • Apply dynamic configuration • No data => replication
bootstrap/initdb • Start server processor
Megabase: Injecting Configuration
!32
Strategies • Environment variables - via deploy config
• Configuration files - via object storage - version controlled
• Dynamic computation from cgroup limits
• Docker images - via image registry
!33
Megabase API Server
Pre/Post deployment dependencies Manage container runtime dependencies Interface to docker over https Support custom workflows for database services megabase server
!38
Megabase API: DependenciesContainers expect and validate bind mount under data directory Megabase deploy pre-step to docker run: • Carve off extents for logical volume and make filesystem • Create systemd unit file for persistence and mount volume
!40
Megabase API: Container RuntimeReduce client config burden Offload predictable defaults: • Resource Limits • Bind mount volume to data directory • Host networking • Inject environment secrets
Example docker run on cli =>
!43
Megabase… Client?
“So we’re just supposed to type out those nasty, long curl commands to operate everything? Really? And TLS client auth too? Are you trying to be annoying?”
- anonymous New Relic db-team member
mb client
!45
Megabase Client: mbmb: Command Line Interface • wrap up server API functionality into command line interface • transparently handle TLS authentication • query state, health and configuration across servers and deployments • coordinate deployments across servers • leverage inventory to resolve host and container names
!47
Megabase Client: Deployment ManifestMinimum info required: • Container name
• Image path and tag
• CPU, memory and disk resources
• VIP for load balancer and DNS
• Owning team
Auto select values if absent
Define one or more containers
!48
Megabase Client: mb DeploymentProcess
• Target environment and cluster • Specify a deployment manifest • Generate port and passwords • Send requests to API servers - new logical volume - run container
• Check responses • Validate deployment - health
!50
Monitoring and AlertingNewRelic APM, Infrastructure and Insights • Hardware/System, Percona, Postgres, Redis, Megabase/Golang • Connection availability monitoring events to Insights • Insights dashboards created for deploys
!59
Concern: Host Failure Blast Radius
Density of database services per host Maintenance (still) happens Hardware (still) fails Time to recover after host failure Time spent on failovers
!60
…and then it happened.Production host went down RAID controller failed Several active primary instances affected Time to recover was higher than we expected Started DRI sprint to address host failures
!61
Fix: Improve Tools for Mass FailoverTooling updated to support failover: • All unhealthy database pools: ‘failover pools unhealthy’ • By megabase hostname: ‘failover host demo-megabase-1c’
!62
Redis Tuning Memory LimitsRedis deploy for experiment with RDB bgsave enabled Keyspace sizing:
• docker update used to walk memory limits up to 2GB, 4GB and 16GB
• redis maxmemory limit increase to 1.5GB
!66
Fix: Adjust Alert PoliciesNew Relic NRQL Alerts • Adjusted swap alert • Added alert on kernel
version mismatch
!68
Keys to Our Success• Internal team supporting internal customers • Scoped to meet our specific goals • Controlled slow roll out and adoption • Team autonomy and control of our hardware • Existing APIs and tools from other teams • Limiting technologies and versions involved • Balancing trade-offs • Minimal container resource limits
Outcomes
!69
📦 Packaging and Deployment • is consistent and repeatable
🚚 Database Delivery Time • takes minutes not months
💰 Cost Efficiency • fewer wasted resources
🌍 New Regions are Easy* • single command mass deploys
!71
References• https://mesosphere.github.io/marathon/docs/persistent-volumes.html • https://docs.mesosphere.com/1.11/tutorials/stateful-services/ • https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ • https://youtu.be/J-Ke0TxGUSg (Kubernetes StatefulSet) • https://coreos.com/blog/introducing-operators.html • https://youtu.be/faUQcd5_MUc (Towards Running Stateful Applications on Nomad) • https://github.com/hashicorp/nomad/issues/150 • https://twitter.com/kelseyhightower/status/963415653930553345 • https://twitter.com/kelseyhightower/status/963418681148502016 • https://www.joyent.com/blog/dbaas-simplicity-no-lock-in • https://www.joyent.com/blog/persistent-storage-patterns • https://thenewstack.io/methods-dealing-container-storage/ • https://techcrunch.com/2015/11/21/i-want-to-run-stateful-containers-too/