observing enterprise kubernetes clusters at...

59
Observing Enterprise Kubernetes Clusters At Scale Joe Salisbury @salisbury_joe

Upload: others

Post on 09-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Observing Enterprise Kubernetes Clusters At Scale

Joe Salisbury@salisbury_joe

Page 2: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Product Owner - Internal Platform Team

How do we empower Product teams?

2

Page 3: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Giant Swarm manages Kubernetes clusters for enterprises

3

Page 4: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Control plane for managing Kubernetes clusters

All Kubernetes clusters completely managed

4

Page 5: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- ~35 people- 100s of Clusters- 1000s of Nodes- EU, USA, China

Scale

5

Page 6: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- AWS- Azure- On-Prem

Providers

6

Page 7: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Giant Swarm takes care of your infrastructure

You focus on your business value

7

Page 8: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Fully managed==

Responsible for everything

8

Page 9: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Managed Apps- Kubernetes- Actual Infrastructure

9

What is Everything?

Page 10: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Responsible for everything==

Monitoring for everything

10

Page 11: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Observing Kubernetes

11

Page 12: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Metrics- Logging- Tracing

Monitoring Domains

12

Page 13: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- EFK stack- Mainly used for deep debugging after the fact- Looking at Loki for the future

- Lighter, Prometheus / Grafana integration

Logging

13

Page 14: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Looking at Jaeger- Helpful for our API services (request-response)

- Tip of the iceberg- Most likely will kill these in future

- Still researching tracing for operators- Async background processing- Lots of small traces

Tracing

14

Page 15: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Metrics -> Prometheus

15

Page 16: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Present- Pains- Plans

Our Prometheus Journey

16

Page 17: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Monitoring is an evolutionary processPresent

17

Page 18: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

18

Tenant ClustersControl Plane

API API Server, Kubelets, etc.

API Server, Kubelets, etc.

API Server, Kubelets, etc.

Operators

Monitoring

Page 19: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- We have a Prometheus server running on the control plane - we can use it to monitor all the tenant clusters!

- This was maybe a good idea at the time

‘We need to monitor clusters’

19

Page 20: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Dependencies

- Tenant clusters routable from the control plane- Peering / IPAM

20

Page 21: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

21

Control Plane VPC

10.0.0.0/16

control plane:tenant clusters:/24 mask

10.1.0.0/24 10.1.1.0/24 10.1.2.0/24

10.0.0.0/16 (10.0.0.0 -> 10.0.255.255)

10.1.0.0/16 (10.1.0.0 -> 10.1.255.255)

Tenant Cluster VPC Tenant Cluster VPC Tenant Cluster VPC

Page 22: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Configuration- Automatically adding tenant clusters to

Prometheus

22

Page 23: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Sidecar for Prometheus

- Watches for Kubernetes Custom Resources- Updates Prometheus ConfigMap- Fetches certificates, shares via emptyDir- Reloads Prometheus on changes

prometheus-config-controller

23

Page 24: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

24

prometheus-config-controllerChartconfig CR

Chartconfig CRClusters prometheus

Prometheus ConfigMap

Chartconfig CR

Chartconfig CRCertificates Certificate

Volume

watches

reads

syncs reads

reloads

Page 25: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

25

Tenant ClustersControl Plane

Prometheus API Server, Kubelets, etc.

API Server, Kubelets, etc.

API Server, Kubelets, etc.

Page 26: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

26

also add node-exporter, ingress-controllers, coredns, custom exporters, all the control plane services, the kitchen sink...

Page 27: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- AlertManager & OpsGenie

- Heartbeats for each installation- Always firing alert in Prometheus- Special routing to OpsGenie in AlertManager- Heartbeat support in OpsGenie (page if no

ping)

27

Page 28: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

28

prometheus alertmanager

Installation 1

alertmanager

Installation 2

prometheus alertmanager

Installation 3

Installation 2 is down, ding ding ding

Page 29: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- In production for most of 2018, and a fair chunk of 2019 now

- Added more targets, some improvements, but no major architectural changes

And it works!

29

Page 30: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Roll for InitiativePains

30

Page 31: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Number of clusters correlates (ish) with number of series

- Number of series correlates with memory usage

Prometheus Memory Usage

31

Page 32: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Currently forced to scale vertically - Fine for now, but not where we want to be in

the future- We want to enable developers to add tons of

metrics- Trend will only continue

32

Page 33: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Prometheus v2.9.1 (from v2.6.0)

33

- Go 1.12!

Page 34: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Outgrown / outgrowing our initial assumption that customers would run a handful of small tenant clusters

- We can drop metrics we don’t need (e.g: cadvisor for customer workloads) as needed

- But, not a long term solution

34

Page 35: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- If the Prometheus server goes down, we lose monitoring for all tenant clusters- We can have a better failure mode- e.g: lose monitoring for some percentage of

tenant clusters

Reliability

35

Page 36: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Having separate installations is great most of the time

- Pain in the ass for querying- Digging into a global view

- Have to look at multiple Grafanas- Percentage of data we see will decrease over time

(human patience is a constant)

Querying

36

Page 37: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

A collection of ideas for the futurePlans

37

Page 38: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Goal for 2019 is to improve the scalability of our metrics infrastructure

38

Page 39: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- If we can’t scale vertically, let’s scale horizontally!

- One Prometheus per tenant cluster (at least)

Addressing Prometheus Scaling

39

Page 40: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- prometheus-operator- Use building blocks!

- Build a new operator that watches our Cluster CRs, ensures CRs for prometheus-operator

40

Page 41: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

41

prometheus-config-operatorChartconfig CR

Chartconfig CRCluster CR Chartconfig

CRChartconfig CRPrometheus CR prometheus-operator

Prometheus PrometheusPrometheus

watches watchesensures

ensures

Page 42: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

42

Tenant ClustersControl Plane

Prometheus API Server, Kubelets, etc.

API Server, Kubelets, etc.

API Server, Kubelets, etc.

Prometheus

Prometheus

Prometheus

Page 43: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Codify our Prometheus topology in one service

43

Page 44: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Provide one feature with one service- Provide / use building blocks / abstraction layers- Codify business logic in one operator

44

Page 45: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- We may need to support multiple Prometheus

servers per Kubernetes cluster (for gargantuan clusters)- We can transition into it- e.g: prometheus-config-operator can create

multiple Prometheus CRs for one tenant cluster

- Benefit of having topology codified in one operator

45

Page 46: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Sharding Prometheus allows us to scale horizontally

- Increases scalability and reliability- Can scale control plane horizontally- Failure modes are better

46

Page 47: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Still early days- Let’s try Cortex!- All Prometheus servers use remote write to write

to a Cortex backend- Use Cortex for global querying (one Grafana to

rule them all)

- Keep alerting at installation level

Global Observability

47

Page 48: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Empowerment

48

Page 49: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

What does this help us do in the future?

49

Page 50: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Giant Swarm builds and operates one product

No custom infrastructure

50

Page 51: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Feedback loop

- Monitoring to detect- Postmortems to fix- Pipeline to deploy

Detect, Fix, Deploy

51

Page 52: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Learnings from one installation rolled out to all customers

52

Page 53: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- Monitoring enables this feedback loop- Improving monitoring improves this feedback loop

- Kind of the point of an internal platform team :D

53

Page 54: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Good observability is not just reactive

Aim to work proactively

54

Page 55: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

What questions do you have?

Tobias is doing a workshop tomorrow!

Bam!

55

Page 56: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

Thank you!

Joe Salisbury@salisbury_joe

Page 57: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

- e.g: Adidas reports issue with 95th percentile DNS latency- Add alerting for high 95th percentile DNS

latency- Improve DNS dashboard to better show

distribution- Update default CoreDNS configuration for

mitigate (autopath)- Fix lib-musl issue (don’t use the library)

57

Page 58: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

58

Page 59: Observing Enterprise Kubernetes Clusters At Scalecontinuouslifecycle.london/wp-content/uploads/2019/01/Joe-Salisbur… · All Kubernetes clusters completely managed 4 - ~35 people

59