Building Robust Systems With Consul
I’m Mitchell HashimotoAlso known as @mitchellh
HashiCorpTowards a Software Managed Datacenter
Vagranthttp://www.vagrantup.com
Packerhttp://www.packer.io
SERFhttp://www.serfdom.io
Consulhttp://www.consul.io
Consul
Take a Step BackTaking a look at the big picture.
Node
Service Service Service
Hypervisor
Node Node Node
S S S S S S S S S
Hypervisor
Node Node Node
Container
S S Container S Container
S S S S S S
Hypervisor
Node Node Node
Container
S S Container S Container
S S S S S S
Modern OpsMore everything, more problems.
• Where is service foo?• Is service foo healthy/available?• What is service foo’s
configuration?• Where is the service foo leader?
Meta:
What happens when the thing that answers these questions is unavailable?
Robust SystemsStem from the ability to answer these questions.
• Start services in any order• Destroy services with confidence• Restart servers safely• Reconfigure services easily
Practical Goals
• Where is service foo?• Is service foo healthy/available?• What is service foo’s
configuration?• Where is the service foo leader?
Where is service foo?
Maybe here: 127.0.0.1Maybe close: 10.0.1.35Maybe there: foo.foohost.com
Is service foo healthy/available?
Yes: Great!No: Avoid or handle gracefully.
What is service foo’s configuration?
Access information, supported features, enabled/disabled.
What is my configuration?
Expect it to be modifiable.
Where is the service foo leader or best choice?
Locality, master/slave, versions.
Meta: Is the thing answering these questions stable/available?
Critical infrastructure component, you want “yes” as often as possible.
Robust! Can find services, can avoid and handle unhealthy services, can be configured externally, and can trust that it can retrieve all of this information.
• Start services in any order• Destroy services with confidence• Restart servers safely• Reconfigure services easily
Practical Goals
Consul
Solution AttemptsIn a world… before Consul...
Manual/Hardcoded• Doesn’t scale with services/nodes• Not resilient to failures• Localized visibility/auditability• Manual locality of services
Config Mgmt Problem• Slow to react to changes• Not resilient to failures• Not really configurable by
developers• Locality, monitoring, etc. manual
LB Fronted Services• Introduces different SPOF• How does LB find service
addresses/configure?• Solves some problems, though.
ZooKeeper• Complicated• Heavy clients• Building block, very manual
Consul
Service Discovery
Where is service foo?
Service Discovery$ dig web-frontend.service.consul. +short10.0.3.8910.0.1.46
$ curl http://localhost:8500/v1/catalog/service/web-frontend[{ “Node”: “node-e818f1”, “Address”: “10.0.3.89”, “ServiceID”: “web-frontend”, …}]
Service Discovery
• DNS is legacy-friendly. No application changes required.
• HTTP returns rich metadata.
Failure Detection
Is service foo healthy/available?
Failure Detection
Failure Detection
• DNS won’t return non-healthy services or nodes.
• HTTP has endpoints to list health state of catalog.
Key/Value Storage
What is the config of service foo?
Key/Value Storage$ curl –X PUT –d ‘bar’ http://localhost:8500/v1/kv/footrue
$ curl http://localhost:8500/v1/kv/foo?rawbar
Key/Value Storage
• Highly available storage of configuration.
• Turn knobs without big configuration management process.
Multi-Datacenter
Multi-Datacenter$ dig web-frontend.singapore.service.consul. +short10.3.3.3310.3.1.18
$ dig web-frontend.germany.service.consul. +short10.7.3.4110.7.1.76
Multi-Datacenter$ curl http://localhost:8500/v1/kv/foo?raw&dc=asiatrue
$ curl http://localhost:8500/v1/kv/foo?raw&dc=eufalse
Multi-Datacenter
• Local by default• Can query other datacenters
however you may need to
Web UI
Web UI
• Node, service, health check, and K/V management and visibility for every datacenter in a single UI.
OperationsConsul Availability / Scalability
The Meta Question
Architecture
Server Cluster• 3, 5, 7 servers• (n/2) + 1 for
availability• Replicated writes• Automatic leader
election, leader forwarding.
Lightweight Clients• Ephemeral state• Health checks• Optional (but
recommended). Legacy machines don’t need them.
• Automatic request forwarding to servers.
Cheap Gossip• Health check and
membership info.• Very cheap• No guaranteed
reliability, but only used for data that can be lost
• (See Serf)
Multi-DC• Independent server
clusters• Request forwarding• WAN gossip for
membership
General Points: Servers
• (n+1)/2 servers for write avail• More servers means higher write latency
because of replication. Throughput marginally affected.
• Can leave/add at will, keeping in mind min. node requirement.
General Points: Clients• Clients can be removed/added at will
without issue.• Clients don’t currently affect read/write
throughput in a meaningful way.• Although technically optional, they’re
highly recommended for delegated health checks.
Throughput
• On virtualized cloud systems with spinning disks: thousands of reads and writes per second
• Practically won’t hit read/write limit
Scalable and available. Consul’s architecture makes it incredibly scalable and highly unlikely to become unavailable.
Robust SystemsConsul configured, monitored, discovered
• Consul KV for configuration.• Consul DNS for service
coupling/discovery.• Consul Health Checks for
monitoring.
Consul KV: Configuration
Consul KV: Configuration$ envconsul –reload myapp/config bin/myapp…
Consul KV: Configuration
• envconsul turns K/V into environmental variables and restarts on change.
• No application changes!
Consul DNS: Service Discovery$ envconsul myapp/config envELASTICSEARCH_HOST=elasticsearch.service.consul.POSTGRESQL_HOST=master.postgresql.service.consul.REDIS_HOST=redis.service.consul.
Consul DNS: Service Discovery
• Configuration to point to other services uses DNS.
• No application changes!
Consul Health Checks: Monitoring$ cat /etc/consul.d/web.json{ “check”: { “name”: “http”, “script”: “curl localhost:80”, “interval”: “5s” }}
Consul Health Checks: Monitoring
Consul Health Checks: Monitoring
• Simple shell scripts (UNIXy)• Logged output• Won’t show as result in service
discovery queries if failing.
Robust! Add/remove services, reconfigure services, see global state of services without complicated logic. And without modifying application code.
Thank You
http://www.consul.io