observing enterprise kubernetes clusters at...

Observing Enterprise Kubernetes Clusters At Scale

Joe Salisbury@salisbury_joe

Product Owner - Internal Platform Team

How do we empower Product teams?

2

Giant Swarm manages Kubernetes clusters for enterprises

3

Control plane for managing Kubernetes clusters

All Kubernetes clusters completely managed

4

- ~35 people- 100s of Clusters- 1000s of Nodes- EU, USA, China

Scale

5

- AWS- Azure- On-Prem

Providers

6

Giant Swarm takes care of your infrastructure

You focus on your business value

7

Fully managed==

Responsible for everything

8

- Managed Apps- Kubernetes- Actual Infrastructure

9

What is Everything?

Responsible for everything==

Monitoring for everything

10

Observing Kubernetes

11

- Metrics- Logging- Tracing

Monitoring Domains

12

- EFK stack- Mainly used for deep debugging after the fact- Looking at Loki for the future

- Lighter, Prometheus / Grafana integration

Logging

13

- Looking at Jaeger- Helpful for our API services (request-response)

- Tip of the iceberg- Most likely will kill these in future

- Still researching tracing for operators- Async background processing- Lots of small traces

Tracing

14

Metrics -> Prometheus

15

- Present- Pains- Plans

Our Prometheus Journey

16

Monitoring is an evolutionary processPresent

17

18

Tenant ClustersControl Plane

API API Server, Kubelets, etc.

API Server, Kubelets, etc.


Operators

Monitoring

- We have a Prometheus server running on the control plane - we can use it to monitor all the tenant clusters!

- This was maybe a good idea at the time

‘We need to monitor clusters’

19

- Dependencies

- Tenant clusters routable from the control plane- Peering / IPAM

20

21

Control Plane VPC

10.0.0.0/16

control plane:tenant clusters:/24 mask

10.1.0.0/24 10.1.1.0/24 10.1.2.0/24

10.0.0.0/16 (10.0.0.0 -> 10.0.255.255)

10.1.0.0/16 (10.1.0.0 -> 10.1.255.255)

Tenant Cluster VPC Tenant Cluster VPC Tenant Cluster VPC

- Configuration- Automatically adding tenant clusters to

Prometheus

22

- Sidecar for Prometheus

- Watches for Kubernetes Custom Resources- Updates Prometheus ConfigMap- Fetches certificates, shares via emptyDir- Reloads Prometheus on changes

prometheus-config-controller

23

24

prometheus-config-controllerChartconfig CR

Chartconfig CRClusters prometheus

Prometheus ConfigMap

Chartconfig CR

Chartconfig CRCertificates Certificate

Volume

watches

reads

syncs reads

reloads

25


Prometheus API Server, Kubelets, etc.



26

also add node-exporter, ingress-controllers, coredns, custom exporters, all the control plane services, the kitchen sink...

- AlertManager & OpsGenie

- Heartbeats for each installation- Always firing alert in Prometheus- Special routing to OpsGenie in AlertManager- Heartbeat support in OpsGenie (page if no

ping)

27

28

prometheus alertmanager

Installation 1

alertmanager

Installation 2

prometheus alertmanager

Installation 3

Installation 2 is down, ding ding ding

- In production for most of 2018, and a fair chunk of 2019 now

- Added more targets, some improvements, but no major architectural changes

And it works!

29

Roll for InitiativePains

30

- Number of clusters correlates (ish) with number of series

- Number of series correlates with memory usage

Prometheus Memory Usage

31

- Currently forced to scale vertically - Fine for now, but not where we want to be in

the future- We want to enable developers to add tons of

metrics- Trend will only continue

32

Prometheus v2.9.1 (from v2.6.0)

33

- Go 1.12!

- Outgrown / outgrowing our initial assumption that customers would run a handful of small tenant clusters

- We can drop metrics we don’t need (e.g: cadvisor for customer workloads) as needed

- But, not a long term solution

34

- If the Prometheus server goes down, we lose monitoring for all tenant clusters- We can have a better failure mode- e.g: lose monitoring for some percentage of

tenant clusters

Reliability

35

- Having separate installations is great most of the time

- Pain in the ass for querying- Digging into a global view

- Have to look at multiple Grafanas- Percentage of data we see will decrease over time

(human patience is a constant)

Querying

36

A collection of ideas for the futurePlans

37

Goal for 2019 is to improve the scalability of our metrics infrastructure

38

- If we can’t scale vertically, let’s scale horizontally!

- One Prometheus per tenant cluster (at least)

Addressing Prometheus Scaling

39

- prometheus-operator- Use building blocks!

- Build a new operator that watches our Cluster CRs, ensures CRs for prometheus-operator

40

41

prometheus-config-operatorChartconfig CR

Chartconfig CRCluster CR Chartconfig

CRChartconfig CRPrometheus CR prometheus-operator

Prometheus PrometheusPrometheus

watches watchesensures

ensures

42


Prometheus API Server, Kubelets, etc.



Prometheus

Prometheus

Prometheus

Codify our Prometheus topology in one service

43

- Provide one feature with one service- Provide / use building blocks / abstraction layers- Codify business logic in one operator

44

- We may need to support multiple Prometheus

servers per Kubernetes cluster (for gargantuan clusters)- We can transition into it- e.g: prometheus-config-operator can create

multiple Prometheus CRs for one tenant cluster

- Benefit of having topology codified in one operator

45

- Sharding Prometheus allows us to scale horizontally

- Increases scalability and reliability- Can scale control plane horizontally- Failure modes are better

46

- Still early days- Let’s try Cortex!- All Prometheus servers use remote write to write

to a Cortex backend- Use Cortex for global querying (one Grafana to

rule them all)

- Keep alerting at installation level

Global Observability

47

Empowerment

48

What does this help us do in the future?

49

Giant Swarm builds and operates one product

No custom infrastructure

50

Feedback loop

- Monitoring to detect- Postmortems to fix- Pipeline to deploy

Detect, Fix, Deploy

51

Learnings from one installation rolled out to all customers

52

- Monitoring enables this feedback loop- Improving monitoring improves this feedback loop

- Kind of the point of an internal platform team :D

53

Good observability is not just reactive

Aim to work proactively

54

What questions do you have?

Tobias is doing a workshop tomorrow!

Bam!

55

Thank you!

Joe Salisbury@salisbury_joe

- e.g: Adidas reports issue with 95th percentile DNS latency- Add alerting for high 95th percentile DNS

latency- Improve DNS dashboard to better show

distribution- Update default CoreDNS configuration for

mitigate (autopath)- Fix lib-musl issue (don’t use the library)

57

observing enterprise kubernetes clusters at...

Documents