operating consul as an early adopter

41
a talk Nelson Elhage, @nelhage Operating Consul As an Early Adopter

Upload: nelson-elhage

Post on 12-Apr-2017

815 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Operating Consul as an Early Adopter

a talk

Nelson Elhage, @nelhage

Operating ConsulAs an Early Adopter

Page 2: Operating Consul as an Early Adopter

This Talk

• consul @ Stripe• War Stories• Lessons Learned

Page 3: Operating Consul as an Early Adopter

Consul at StripeThe Good, The Bad, The Outages

Page 4: Operating Consul as an Early Adopter

Why Consul?

• Early 2014• Stripe Infra gaining complexity• Nightmarish in-house service registry• Host lists distributed via puppet

Page 5: Operating Consul as an Early Adopter

Why Consul?

• Wanted a better service/host store• consul had everything baked in• Decided to do some test deployments

Page 6: Operating Consul as an Early Adopter

Initial Rollout

• Rolled out across all servers• (started with bake-in in QA)• No clients at all

Page 7: Operating Consul as an Early Adopter

What Could Go Wrong?

• We worried about memory leaks

Page 8: Operating Consul as an Early Adopter

Our First Production Issue

• Noticed one node taking >100M RAM• (others all <50M)• Reached out to armon for advice• bug in the stats framework:• https://github.com/armon/go-metrics/commit/02567bbc4f518a43853d262b651a3c8257c3f141

Page 9: Operating Consul as an Early Adopter

Started Adding Clients

• Hooked into our deploy tool• kept a manual emergency fallback

• Generated LB config from consul• Noticed a surprising rate of errors

Page 10: Operating Consul as an Early Adopter

Raft Instability

• Seeing >1 failover/minute• Reached out to Armon

• “Try 0.3”• “consul is not optimized for spinning disk”

Page 11: Operating Consul as an Early Adopter

Rolling out 0.3

• Roll to QA first• Nothing works!• Check logs: TLS verification errors

Page 12: Operating Consul as an Early Adopter

Rolling out 0.3

• 0.3 changed TLS verification to check the cert name

• Change our SSL issuing to add SANs• 2014/06/16 16:52:57 [ERR] raft: Failed to make RequestVote

RPC to 10.100.29.175:8300: x509: certificate is valid for [remote host], not [local host]

Page 13: Operating Consul as an Early Adopter

0.3 TLS Woes

• Whoops! consul was checking the remote cert against the local node name

• armon> we just use "demo.consul.io" as the CN for all of them

• 0.3 essentially completely broke TLS

Page 14: Operating Consul as an Early Adopter

0.3.1

• I wrote and got merged a patch to restore 0.2 behavior

• Rolled forward to 0.3.1• Upgraded to SSD-backed servers

Page 15: Operating Consul as an Early Adopter

Increasing Rollout

• Switched various operational tools from flatfile to consul

• Main app started using consul at startup

Page 16: Operating Consul as an Early Adopter

Consensus is Hard

Page 17: Operating Consul as an Early Adopter

consul-template

• Generating haproxy config using consul-template• https://github.com/hashicorp/consul-template/

issues/168 – `consul-template` takes O(N²) time with N services

Page 18: Operating Consul as an Early Adopter

consul-template

• Got that fixed, turned it on• consul immediately fell over• multiple elections/minute• 2M allocations/minute

Page 19: Operating Consul as an Early Adopter

consul-template

• Service Watches churn when any service changes health state

• Watching services on a large cluster → self-DDOS

Page 20: Operating Consul as an Early Adopter

consul-template

• We use `consul-template -once` in cron now

• Worse latency, but it works reliably

Page 21: Operating Consul as an Early Adopter

consul for leader election

• Our data team wanted a leader-election primitive

• Built on top of consul, cribbing example code

Page 22: Operating Consul as an Early Adopter

Sometime Later…

Page 23: Operating Consul as an Early Adopter

goroutine leak

• consul would rapidly eat all memory• larger heap -> large GC pauses -> raft

instability• manually restarted cluster 1/day

Page 24: Operating Consul as an Early Adopter

goroutine leak

• Reached out to Armon• Very helpful in debugging• Found several unrelated memory leaks

Page 25: Operating Consul as an Early Adopter

goroutine leak

• Tried to figure out what changed• Eventually correlated to a session leak in

our leader election code

Page 26: Operating Consul as an Early Adopter

goroutine leak

• Fixed our leader-election code• New policy: No non-discovery uses of

consul

Page 27: Operating Consul as an Early Adopter

consul DNS

• Increasingly reliant on consul for internal discovery

• Unhappy at exposure to periodic instability• Still have fallbacks, but outages remain painful

Page 28: Operating Consul as an Early Adopter

consul DNS

• Solution: Use consul-template to compile consul DNS to a zone file

• Serve that out of a normal DNS server• Refresh every 15s

Page 29: Operating Consul as an Early Adopter

Current Status

• Run consul everywhere• Register all services• Request-path lookups hit cached DNS• Operational tools use HTTP interface• Also generate config from consul-template

Page 30: Operating Consul as an Early Adopter

Final Stability Note

• consul 0.5.2 fixed our memory leaks• consul has been quite stable for us of late• consul-template watches still don’t scale

• 0.6 should help

Page 31: Operating Consul as an Early Adopter

Lessons Learnedbeing an early adopter without bringing down the site

(too many times)

Page 32: Operating Consul as an Early Adopter

Expect It To Be Rough

Page 33: Operating Consul as an Early Adopter

Monitoring, Monitoring, Monitoring

Page 34: Operating Consul as an Early Adopter

(graph all the things)

Page 35: Operating Consul as an Early Adopter

Incremental Rollout

Page 36: Operating Consul as an Early Adopter

Limit Scope

Page 37: Operating Consul as an Early Adopter

Isolation

Page 38: Operating Consul as an Early Adopter

Upgrade Aggressively

Page 39: Operating Consul as an Early Adopter

Get To Know Upstream

Page 40: Operating Consul as an Early Adopter

Be Willing to Dive In

Page 41: Operating Consul as an Early Adopter

Questions?