how yelp does service discovery
TRANSCRIPT
![Page 1: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/1.jpg)
SmartStack, Docker and Yocalhost
How Yelp Does Service Discovery
![Page 2: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/2.jpg)
[Demo]
![Page 3: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/3.jpg)
● This works from (almost) any host in Yelp
● This works from Python, Java, command line etc.
● If a service supports HTTP or TCP then it can be made discoverable.
○ This includes third-party services such as MySQL and scribe
● It’s dynamic: for a given service, if new instances are added then they
will automatically become available.
Very Important Things to Note
![Page 4: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/4.jpg)
● SmartStack (nerve and synapse) were written by Airbnb
● We’ve added some features
● The work here has been carried out by many people across Yelp
Credits
![Page 5: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/5.jpg)
Registration
![Page 6: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/6.jpg)
Architecture
hacheck
service_1
service_2
service_3
Service host
ZK
configure_nerve.py
nerve
![Page 7: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/7.jpg)
Nerve registers service instance in ZooKeeper:
/nerve/region:myregion
├── service_1
│ └── server_1_0000013614
├── service_2
│ └── server_1_0000000959
├── service_3
│ ├── server_1_0000002468
│ └── server_2_0000002467
[...]
ZooKeeper data
![Page 8: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/8.jpg)
The data in a znode is all that is required to connect to the corresponding service instance.
We’ll shortly see how this is used for discovery.
{
"host":"10.0.0.123",
"port":31337,
"name":"server_1",
"weight":10,
}
ZooKeeper data
![Page 9: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/9.jpg)
hacheck
Normally hacheck just acts as a transparent proxy for our healthchecks:
$ curl -s yocalhost:6666/http/service_1/1234/status | jq .
{
"uptime": 5693819.315988064,
"pid": 2595160,
"host": "server_1",
"version": "b6309e09d71da8f1e28213d251f7c3515878caca",
}
![Page 10: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/10.jpg)
hacheck
We can also use it to fail healthchecks before we shut down a service.
This allows us to gracefully shutdown a service.
(Also provides a 1s cache to limit healthcheck rate.)
$ hadown service_1
$ curl -v yocalhost:6666/http/service_1/1234/status
Service service_1 in down state since 1443217910: billings
![Page 11: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/11.jpg)
configure_nerve.py
How do we know what services to advertise? Every service host periodically runs a script to regenerate the nerve configuration, reading from the following sources:
● yelpsoa-configsruns_on:
server_1
server_2
● puppetnerve_simple::puppet_service {'foo'}
● mesos slave API
![Page 12: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/12.jpg)
Discovery
![Page 13: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/13.jpg)
Architecture
ZK
client
synapse
haproxy
configure_synapse.py
nerve
![Page 14: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/14.jpg)
HAProxy
● By default bind to 0.0.0.0
● Bind only to yocalhost on public servers.
● HAProxy gives us a lot of goodies for all clients:
○ Redispatch on connection failures
○ Zero-downtime restarts (once you know how :)
○ Easy to insert connection logging
● Each host also exposes an HAProxy status page for easy introspection
![Page 15: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/15.jpg)
configure_synapse.py
Every client host periodically runs a script to regenerate the synapse configuration, reading service definitions from yelpsoa-configs.
For each service reads a smartstack.yaml file.
Restarts synapse if configuration has changed.
![Page 16: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/16.jpg)
smartstack.yaml
main:
proxy_port: 20973
mode: http
healthcheck_uri: /status
timeout_server_ms: 1000
![Page 17: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/17.jpg)
Namespaces
main:
proxy_port: 20001
mode: http
healthcheck_uri: /status
timeout_server_ms: 1000
long_timeout:
proxy_port: 20002
mode: http
healthcheck_uri: /status
timeout_server_ms: 3000
Same service, different ports
![Page 18: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/18.jpg)
Escape hatch
Some client libraries like to do their own load balancing e.g. cassandra, memcached. Use synapse to dump the registration information to disk:
$ cat /var/run/synapse/services/devops.demo.json | jq .
[
{
"host":"10.0.0.123",
"port":31337,
"name":"server_1",
"weight":10,
}
]
![Page 19: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/19.jpg)
Docker + Yocalhost
![Page 20: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/20.jpg)
Architecture
haproxy
docker container 1
lo 127.0.0.1
docker container 2
lo 127.0.0.1
eth0 169.254.14.17
eth0 169.254.14.18
docker0 169.254.1.1
eth0 10.0.1.2
lo:0 169.254.255.254
lo 127.0.0.1
![Page 21: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/21.jpg)
yocalhost
● We’d like to run only one nerve / synapse / haproxy per host
● What address should we bind haproxy to?
● 127.0.0.1 won’t work from within a container
● Instead we pick a link-local address 169.254.255.254 (yocalhost)
● This also works on servers without docker
![Page 22: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/22.jpg)
Locality-aware discovery
![Page 23: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/23.jpg)
Overview
We run services in both our own datacenters as well as AWS.
We logically group these environments according to latency.
Service authors get to decide how ‘widely’ their service instances are advertised.
Everything is controlled via smartstack.yaml files.
![Page 24: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/24.jpg)
Latency hierarchies
habitat
region
superregion
ZooKeepers live
here
Datacenters or AZs in AWS
Habitats within 1ms round-trip e.g. ‘us-west-1’
Regions within 5ms round-trip e.g. ‘pacific north-west’
![Page 25: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/25.jpg)
main:
proxy_port: 20973
advertise: [habitat]
discover: habitat
advertise / discover
Synapse should look in the
habitat directory in its local
ZooKeeper
Nerve should register this
service in the habitat directory
of its local ZooKeeper
![Page 26: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/26.jpg)
ZooKeeper data, revisited
/nerve
├── region:us-west-1
│ └── service_1
│ └── server_1_0000013614
├── region:us-west-2
│ └── service_2
│ └── server_2_0000000959
[...]
![Page 27: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/27.jpg)
Extra advertisements
“Wouldn’t it be useful if we could make a service running in datacenter Aavailable in an (arbitrary) datacenter B?”
Why?
● Makes it easier to bring up a new datacenter
● Makes it easier to add more capacity to a datacenter in an emergency
● Makes it easier to keep a datacenter going in an emergency if a service
fails
![Page 28: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/28.jpg)
main:
advertise: [region]
discover: region
extra_advertise:
region:us-west-1: [region:us-west-2]
extra_advertise
![Page 29: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/29.jpg)
Design choices
![Page 30: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/30.jpg)
Unix 4eva
● Lots of little components, each doing doing one thing well
● Very simple interface for clients and services
○ If it speaks TCP or HTTP we can register it
● Easy to independently replace components
○ HAProxy -> NGINX?
● Easy to observe behavior of components
![Page 31: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/31.jpg)
It’s OK if ZooKeeper fails
● Nerve and Synapse keep retrying
● HAProxy keeps running but with no updates
● HAProxy performs its own healthchecks against service instances
○ If a service instance becomes unavailable then it will stop receiving
traffic after a short period
● The website stays up :)
![Page 32: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/32.jpg)
Does it blend scale?
● Used to have scaling issues with internal load balancers, this is not a
problem with SmartStack :)
● Hit some scaling issues at 10s of thousands of ZooKeeper connections
○ Addressed this by using just a single ZooKeeper connection from
each nerve and synapse
● Used to have lots of HAProxy healthchecks hitting services
○ hacheck insulates services from this
○ We limit HAProxy restart rate
![Page 33: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/33.jpg)
What about etcd / consul / …?
● We try to use boring components :)
● We’re already using Zookeeper for Kafka and ElasticSearch so it’s
natural to use it for our service discovery system too.
● etcd would probably also work, and is supported by SmartStack
● Conceptually similar to consul / consul-template
![Page 34: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/34.jpg)
What about DNS?
● What TTL are you going to use?
● Are you clients even going to honor the TTL?
● Does the DNS resolution happen inline with requests?
![Page 35: How Yelp does Service Discovery](https://reader034.vdocuments.us/reader034/viewer/2022050614/5885ea291a28ab864f8b473d/html5/thumbnails/35.jpg)
Conclusions
● We’ve used SmartStack to create a robust service discovery system
● It’s UNIXy: lots of separate components, each doing one thing well
● It’s flexible: locality-aware discovery
● It’s reliable: new devs at Yelp view discovery as a solved problem
● It’s useful: SmartStack is the glue that holds our SOA together