kubernetes the very hard way - o'reilly · 2019. 11. 12. · containerd admin or controller...
TRANSCRIPT
![Page 1: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/1.jpg)
Kubernetesthe Very Hard Way
Laurent Bernaille
Staff Engineer, Infrastructure
@lbernail
![Page 2: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/2.jpg)
lbernail
Datadog
Over 350 integrationsOver 1,200 employeesOver 8,000 customersRuns on millions of hostsTrillions of data points per day
10000s hosts in our infra10s of k8s clusters with 50-2500 nodesMulti-cloudVery fast growth
![Page 3: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/3.jpg)
lbernail
Why Kubernetes?
Dogfooding
Improve k8s integrations
Immutable
Move from Chef
Multi Cloud
Common API
Community
Large and Dynamic
![Page 4: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/4.jpg)
The very hard way?
![Page 5: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/5.jpg)
It was much harder
![Page 6: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/6.jpg)
lbernail
This talk is about the fine print
“Of course, you will need a HA master setup”
“Oh, and yes, you will have to manage your certificates”
“By the way, networking is slightly more complicated, look into CNI / ingress controllers”
![Page 7: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/7.jpg)
lbernail
What happens after “Kube 101”1. Resilient and Scalable Control Plane2. Securing the Control Plane
a. Kubernetes and Certificatesb. Exceptions?c. Impact of Certificate Rotation
3. Efficient networkinga. Giving pod IPs and routing themb. Ingresses: Getting data in the cluster
![Page 8: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/8.jpg)
lbernail
What happens after “Kube 101”1. Resilient and Scalable Control Plane2. Securing the Control Plane
a. Kubernetes and Certificatesb. Exceptions?c. Impact of Certificate Rotation
3. Efficient networkinga. Giving pod IPs and routing themb. Ingresses: Getting data in the cluster
![Page 9: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/9.jpg)
Resilient and Scalable Control Plane
![Page 10: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/10.jpg)
lbernail
Kube 101 Control Plane
kubelet kubectl
etcd
apiserver
controllersscheduler
Master
in-cluster apps
Service
![Page 11: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/11.jpg)
lbernail
Making it resilientetcd
apiserver
controllersscheduler
kubelet kubectl
Master
etcd
apiserver
controllersscheduler
Master
etcd
apiserver
controllersscheduler
Master
LoadBalancer
in-cluster apps
Service
![Page 12: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/12.jpg)
lbernail
Kube 101 Control Plane
kubelet kubectl
etcd
apiserver
controllersscheduler
Master
in-cluster apps
Service
![Page 13: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/13.jpg)
lbernail
apiserver
controllersscheduler
kubelet kubectl
Masterapiserver
controllersscheduler
Masterapiserver
controllersscheduler
Master
LoadBalancer
in-cluster apps
Service
Separate etcd nodesetcd
etcd
![Page 14: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/14.jpg)
lbernail
apiserver
controllersscheduler
kubelet kubectl
Masterapiserver
controllersscheduler
Masterapiserver
controllersscheduler
Master
LoadBalancer
in-cluster apps
Service
Single active Controller/scheduleretcd
etcd
![Page 15: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/15.jpg)
lbernail
apiserver
controllers
kubelet kubectl
apiserver apiserver
LoadBalancer
in-cluster apps
Service
Split scheduler/controllers
controllers
schedulers
schedulers
etcd
![Page 16: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/16.jpg)
lbernail
apiserver
controllers
kubelet kubectl
apiserver apiserver
LoadBalancer
in-cluster apps
Service
Split etcd
controllers
schedulers
schedulers
etcd etcd events
![Page 17: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/17.jpg)
lbernail
apiserver
controllers
kubelet kubectl
apiserver apiserver
LoadBalancer
in-cluster apps
Service
Sizing the control plane
controllers
schedulers
schedulers
2x (3 or 5 nodes)disk + net ios
X nodesRAM + net ios
2 nodesCPU
2 nodesCPU
etcd etcd events
![Page 18: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/18.jpg)
lbernail
1. Resilient and Scalable Control Plane2. Securing the Control Plane
a. Kubernetes and Certificatesb. Exceptions?c. Impact of Certificate Rotation
3. Efficient networkinga. Giving pod IPs and routing themb. Ingresses: Getting data in the cluster
What happens after “Kube 101”
![Page 19: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/19.jpg)
Kubernetes and Certificates
![Page 20: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/20.jpg)
lbernail
From “the hard way”
![Page 21: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/21.jpg)
lbernail
“Our cluster broke after ~1y”
![Page 22: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/22.jpg)
lbernail
Certificates in Kubernetes
● Kubernetes uses certificates everywhere● Very common source of incidents● Our Strategy: Rotate all certificates daily
![Page 23: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/23.jpg)
lbernail
Certificate management
etcd
apiserver
Vault
etcd PKIPeer/Server cert
Etcd Client cert
![Page 24: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/24.jpg)
lbernail
Certificate management
etcd
apiserver
controllers
scheduler
Vault
etcd PKIPeer/Server cert
Etcd Client certkube PKI
Apiserver/kubelet client cert
Controller client cert
Scheduler client cert
kubelet Kubelet client/server cert
![Page 25: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/25.jpg)
lbernail
Certificate management
etcd
apiserver
controllers
scheduler
Vault
etcd PKIPeer/Server cert
Etcd Client certkube PKI
Apiserver/kubelet client cert
kube kvSA public key
SA private key
Controller client cert
Scheduler client cert
In-cluster app
SA token
kubelet Kubelet client/server cert
![Page 26: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/26.jpg)
lbernail
Certificate management
etcd
apiserver
controllers
scheduler
apiservicewebhook...
Vault
etcd PKIPeer/Server cert
Etcd Client cert
apiservice PKI
Apiservice cert (proxy/webhooks)
kube PKIApiserver/kubelet client cert
kube kvSA public key
SA private key
Controller client cert
Scheduler client cert
In-cluster app
SA token
kubelet Kubelet client/server cert
![Page 27: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/27.jpg)
lbernail
Certificate management
etcd
apiserver
controllers
scheduler
apiservicewebhook...
Vault
etcd PKIPeer/Server cert
Etcd Client cert
apiservice PKI
Apiservice cert (proxy/webhooks)
kube PKIApiserver/kubelet client cert
kube kvSA public key
SA private key
Controller client cert
Scheduler client cert
OIDC provider
kubectl
OIDC auth
In-cluster app
SA token
kubelet Kubelet client/server cert
![Page 28: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/28.jpg)
Exception ?Incident...
![Page 29: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/29.jpg)
lbernail
Kubelet: TLS Bootstrap
apiserver
controllersVault
kube PKI
kube kv3- Get signing key
admin
1- Create Bootstrap token
2- Add Bootstrap token to vault
![Page 30: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/30.jpg)
lbernail
Kubelet: TLS Bootstrap
apiserver
controllersVault
kube PKI
kube kv
kubelet
5- Verify RBAC for CSR creator6- Sign certificate
1- Get Bootstrap token
2- Authenticate with token4- Create CSR
7- Download certificate8- Authenticate with cert9- Register node
3- Verify Token and map groups
![Page 31: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/31.jpg)
lbernail
Kubelet certificate issue1. One day, some Kubelets were failing to start or took 10s of minutes2. Nothing in logs3. Everything looked good but they could not get a cert4. Turns out we had a lot of CSRs in flight5. Signing controller was having a hard time evaluating them all
CSR resources in the clusterLower is better!
![Page 32: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/32.jpg)
lbernail
Why?Kubelet Authentication● Initial creation: bootstrap token, mapped to group “system:bootstrappers”● Renewal: use current node certificate, mapped to group “system:nodes“
Required RBAC permissions● CSR creation● CSR auto-approval
CSR creation CSR auto-approval
system:bootstrappers OK OK
system:nodes OK
![Page 33: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/33.jpg)
Exception 2?Incident 2...
![Page 34: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/34.jpg)
lbernail
Temporary solutionapiserver
webhook
Vault
kube kvGet cert and key
admin
Create webhook with self-signed cert as CA
Add self-signed cert + key to Vault
One day, after ~1 year● Creation of resources started failing (luckily only a Custom Resource)● Cert had expired...
![Page 35: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/35.jpg)
lbernail
Take-away● Rotate server/client certificates ● Not easy
But, “If it’s hard, do it often”> no expiration issues anymore
![Page 36: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/36.jpg)
Impact of Certificate rotation
![Page 37: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/37.jpg)
Apiserver certificate rotation
![Page 38: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/38.jpg)
lbernail
Impact on etcdapiserver restarts
etcd slow queries
etcd traffic
We have multiple apiserversWe restart each daily
Significant etcd network impact(caches are repopulated)
Significant impact on etcd performances
![Page 39: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/39.jpg)
Impact on Load-balancersapiserver restarts
ELB surge queue Significant impact on LB as connections are reestablished
Mitigation: increase queues on apiservers net.ipv4.tcp_max_syn_backlog net.core.somaxconn
![Page 40: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/40.jpg)
lbernail
Impact on apiserver clientsapiserver restarts
coredns memory usage
● Apiserver restarts● clients reconnect and refresh their cache
> Memory spike for impacted apps
No real mitigation today
![Page 41: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/41.jpg)
lbernail
Impact on traffic balance
Number of connections / traffic very unbalancedBecause connections are very long-lived
More clients => Bigger impact clusterwide
15MB/s
2.5MB/s
2300 connections
300 connections
![Page 42: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/42.jpg)
lbernail
Why? Simple simulation
Simulation for 48h● 5 apiservers● 10000 connections (4 x 2500 nodes)● Every 4h, one apiserver restarts● Reconnections evenly dispatched
Cause● Cloud TCP load-balancers use round-robin● Long-lived connections● No rebalancing
![Page 43: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/43.jpg)
Kubelet certificate rotation
![Page 44: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/44.jpg)
Pod graceful terminationapiserver
kubelet containerd
admin or controller Delete pod
Stop Containerwith timeout “terminationGracePeriodSeconds”
container
Send SIGTERMAfter timeout, send SIGKILL
![Page 45: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/45.jpg)
Restarts impact graceful termination
apiserver
containerd
admin or controller Delete pod
container
Send SIGTERMAfter timeout, or Context Cancelledsend SIGKILL
Kubelet restarts end graceful terminationFixed upstream“Do not SIGKILL container if container stop is cancelled”https://github.com/containerd/cri/pull/1099
kubelet
![Page 46: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/46.jpg)
Impact on pod readiness
Issue upstream“pod with readinessProbe will be not ready when kubelet restart”https://github.com/kubernetes/kubernetes/issues/78733
kubelet restarts on “system” nodes (coredns + other services)
coredns endpoints NotReady
On kubelet restart● Readiness probes marked as failed● Pods removed from service endpoints● Requires readiness to succeed again
![Page 47: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/47.jpg)
lbernail
Take-awayRestarting components is not transparent
It would be great if○ Components could transparently reload certs (server & client)○ Clients could wait 0-Xs to reconnect to avoid thundering herd○ Reconnections did not trigger memory spikes○ Cloud TCP load-balancers supported least-conn algorithm○ Connections were rebalanced (kill them after a while?)
![Page 48: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/48.jpg)
lbernail
What happens after “Kube 101”1. Resilient and Scalable Control Plane2. Securing the Control Plane
a. Kubernetes and Certificatesb. Exceptions?c. Impact of Certificate Rotation
3. Efficient networkinga. Giving pod IPs and routing themb. Ingresses: Getting data in the cluster
![Page 49: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/49.jpg)
Efficient networking
![Page 50: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/50.jpg)
lbernail
ThroughputTrillions of data points daily
Scale1000-2000 nodes clusters
Network challenges
LatencyEnd-to-end pipeline
TopologyMultiple clustersAccess from standard VMs
![Page 51: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/51.jpg)
Giving pods IPs & Routing them
![Page 52: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/52.jpg)
lbernail
From “the Hard Way”
node IP
Pod CIDR for this node
![Page 53: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/53.jpg)
lbernail
Small cluster? Static routes
Node 1
IP: 192.168.0.1Pod CIDR: 10.0.1.0/24
Routes (local or cloud provider)10.0.1.0/24 => 192.168.0.110.0.2.0/24 => 192.168.0.2
Node 2
IP: 192.168.0.2Pod CIDR: 10.0.2.0/24
Limitslocal: nodes must be in the same subnetcloud provider: number of routes
![Page 54: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/54.jpg)
lbernail
Mid-size cluster? Overlay
LimitsOverhead of the overlayScaling route distribution (control plane)
Node 1
IP: 192.168.0.1Pod CIDR: 10.0.1.0/24
Node 2
IP: 192.168.0.2Pod CIDR: 10.0.2.0/24
VXLAN VXLANTunnel traffic between hostsExamples: Calico, Flannel
![Page 55: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/55.jpg)
lbernail
Large cluster with a lot of traffic?Native pod routing
Performance
Datapath: no overheadControl plane: simpler
Addressing
Pod IPs are accessible from● Other clusters● VMs
![Page 56: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/56.jpg)
lbernail
In practice
On premise
BGPCalicoKube-router
Macvlan
AWS
Additional IPs on ENIsAWS EKS CNI pluginLyft CNI plugin Cilium ENI IPAM
GCP
IP aliases
![Page 57: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/57.jpg)
lbernail
How it works on AWS
eth1
agent Pod 1 Pod 2
kubelet
cni
containerdCRI
CNI
eth0
Attach ENIAllocate IPs
Create
veth
ip 1ip 2ip 3
Routing rule“From IP1, use eth1”
Routing
eth0ip 1
![Page 58: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/58.jpg)
lbernail
Address space planningPod Cidr: /24
● /24 leads to inefficient address usage● sig-network: remove contiguous range requirement for CIDR allocation● But also
○ Address space for node IPs (another /20 per cluster for 4096 nodes)○ Service IP range (/20 would make sense for such a cluster)
● Total: 1 /15 for pods, 2 /20 for nodes and service!
pod cidr 8bitsnode prefix: 12bits10. (8bits) 4bits
Up to 255 pods per nodeSimple addressingUp to 4096 nodes4 bits available
Up to 16 clusters
![Page 59: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/59.jpg)
lbernail
Take-away
● Native pod routing has worked very well at scale● A bit more complex to debug● Much more efficient datapath● Topic is still dynamic (Cilium introduced ENI recently)● Great relationship with Lyft / Cilium● Plan your address space early
![Page 60: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/60.jpg)
Ingresses
![Page 61: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/61.jpg)
lbernail
Ingress: cross-clusters, VM to clusters
A A
A
B B
B
C
C
D
D
Cluster 1
Cluster 2Classic (VM)
C?
C? B?
![Page 62: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/62.jpg)
lbernail
Master
Kubernetes default: LB service
External Client Load-Balancer
pod
pod
pod
kube-proxy
kube-proxy
kube-proxy
NP
NP
NP
Healthchecker
data pathhealth checks
configuration (from watching ingresses on apiservers)
service-controller
![Page 63: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/63.jpg)
lbernail
Master
Inefficient Datapath & cross-application impacts
Web traffic Load-Balancer
web-1
web-2
web-3
kube-proxy
kube-proxy
kube-proxy
NP
NP
NP
Healthchecker
data pathhealth checks
configuration (from watching ingresses on apiservers)
service-controller
kafka
![Page 64: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/64.jpg)
lbernail
Master
ExternalTrafficPolicy: Local?
Web traffic Load-Balancer
web-1
web-2
web-3
kube-proxy
kube-proxy
kube-proxy
NP
NP
NP
Healthchecker
data pathhealth checks
configuration (from watching ingresses on apiservers)
service-controller
kafka
![Page 65: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/65.jpg)
lbernail
L7-proxy ingress controller
data pathhealth checksconfiguration
from watching ingresses/endpoints on apiservers (ingress-controller)from watching LoadBalancer services (service-controller)
External Client Load-Balancer
l7proxy
l7proxy
kube-proxy
kube-proxy
kube-proxy
NP
NP
NP
Heathchecker
ingress-controller
podpod
podpod
Create l7proxy deploymentsUpdate backends using service endpoints
Masterservice-controller
![Page 66: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/66.jpg)
lbernail
Limits
All nodes as backends (1000+)Inefficient datapathCross-application impacts
Alternatives?
ExternalTrafficPolicy: Local?> Number of nodes remains the same> Issues with some CNI plugins
K8s ingress> Still load-balancer based> Need to scale ingress pods> Still inefficient datapath
Challenges
![Page 67: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/67.jpg)
lbernail
Our target: native routing
External Client ALB
pod
pod
pod
Healthchecker
data pathhealth checks
alb-ingress-controller
configuration (from watching ingresses/endpoints on apiservers)
![Page 68: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/68.jpg)
lbernail
Limited to HTTP ingresses
No support for TCP/UDP
Ingress v2 should address this
Remaining challenges
Registration delay
Slow registration with LBPod rolling-updates much faster
Mitigations- MinReadySeconds- Pod ReadinessGates
![Page 69: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/69.jpg)
lbernail
Workaround
External Client Load-Balancer
l7proxy
l7proxy
Heathcheckerpodpod
podpod
Not managed by k8s Dedicated nodesPods in host network
TCP / Registration delay not manageable> Dedicated gateways
![Page 70: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/70.jpg)
lbernail
Take-away
● Ingress solutions are not great at scale yet● May require workarounds● Definitely a very important topic for us● The community is working on v2 Ingresses
![Page 71: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/71.jpg)
Conclusion
![Page 72: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/72.jpg)
lbernail
A lot of other topics
● Accessing services (kube-proxy)● DNS (it’s always DNS!)● Challenges with Stateful applications● How to DDOS <insert ~anything> with Daemonsets● Node Lifecycle / Cluster Lifecycle● Deploying applications● ...
![Page 73: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/73.jpg)
lbernail
Getting started?“Deep Dive into Kubernetes Internals for Builders and Operators”Jérôme Petazzoni, Lisa 2019https://lisa-2019-10.container.training/talk.yml.htmlMinimal cluster, showing interactions between main components
“Kubernetes the Hard Way”Kelsey Hightowerhttps://github.com/kelseyhightower/kubernetes-the-hard-wayHA control plane with encryption
![Page 74: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/74.jpg)
lbernail
You like horror stories?“Kubernetes the very hard way at Datadog”https://www.youtube.com/watch?v=2dsCwp_j0yQ
“10 ways to shoot yourself in the foot with Kubernetes”https://www.youtube.com/watch?v=QKI-JRs2RIE
“Kubernetes Failure Stories”https://k8s.af
![Page 75: Kubernetes the Very Hard Way - O'Reilly · 2019. 11. 12. · containerd admin or controller Delete pod container Send SIGTERM After timeout, or Context Cancelled send SIGKILL Kubelet](https://reader033.vdocuments.us/reader033/viewer/2022060902/609edbfa8c2e2b72f45af060/html5/thumbnails/75.jpg)
lbernail
Key lessonsSelf-managed Kubernetes is hard> If you can, use a managed service
Networking is not easy (especially at scale)
The main challenge is not technical> Build a team> Transforming practices and training users is very important