apache mesos ecosystem at allegro first year of production use
TRANSCRIPT
Apache Mesos Ecosystem at Allegro - First Year of Production Use
Wojciech Lesicki - Product ManagerTomasz Ziarko - Software Engineer
Allegro
● What we do in Allegro?● Our Mesos Ecosystem● How we deploy apps?● Problems we’ve had● Q&A
Agenda
What is Allegro?
Allegro
● 16 years on the market● Started as an auction site and now the
biggest e-commerce company in Poland and one of the biggest in Central and Eastern Europe
● 50% of e-commerce market and 80% of m-commerce market in Poland
● 623 items sold every minute
● 14 mln users (37% population of Poland)
● 201 mln visits, 3 billion page views per month
Our infrastructure and IT
● Two DC● Openstack - 510 hosts,
20128 CPU, 5537 VM+BaaS with openstack Ironic
● Monolith (PHP) and microservices● Around 500 people in IT, most of them
are software engineers
Ok, so why we need Mesos?
Our deployment before Mesos
● No standards, no procedures● Every team did deployment their own way● Inefficient
Architecture
Openstack
Mesos Slave
Mesos Executor
Mesos Slave
Docker Executor
Mesos Master
Discovery agent Discovery Agent
Zookeeper
Marathon
Discovery
Consul
- 100 % openstack (VM + bare metal)
- marathon as scheduler,
- sync, state, election - zookeeper,
- service discovery - consul,
- separated mesos and docker containerizer.
11
Implementation
- multiple clusters,
- each spawned across two datacenters,
- separate ecosystem,
- fair-share distribution between data centers.
- Prod (105 slaves, 1000 CPU)
- Test (96 slaves, 368 CPU)
- Dev (30 slaves, 120 CPU)
dc1 dc2
Prod Network
Prod Mesos Cluster
Test Network
Test Mesos Cluster
Dev Network
Dev Mesos Cluster
12
Implementation
Implementation
$ terraform apply -var "buildnr=setup234" \ -var "branch=mesoscon2016" \-var "marathon_version=0.15.3-1.ubuntu1404" \-var "mesos_version=0.28.0-1boost+glog+protobuf" \-var 'masters.dc1=1' \-var 'slaves.dc1=2' \-var ‘slaves.dc2=1’
openstack_compute_instance_v2.mesos-master-dc1: Refreshing state... (ID: ce86ab7a-3660-4702-bba0-5825ae2350b1)
openstack_compute_instance_v2.mesos-slave-dc1.1: Refreshing state... (ID: 39bfd9c1-f6b0-4056-a3ac-28b0136cb220)
openstack_compute_instance_v2.mesos-slave-dc1.0: Refreshing state... (ID: acfb2e86-b4d1-44bd-b9e0-2eb4685a76ff)
openstack_compute_instance_v2.mesos-slave-dc2.0: Creating…….Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
14
MESOS
Discovery
Config service
SSL Service
MaaS
LBaaS
AppEngine Console (e.q. Bamboo, Stash, Artifactory)
Implementation
Service Discovery
- Registering inside cluster,
- Automatic or manual registration,
- Fail detection, changes detection,
- DC aware services.15
Service Discovery
Marathon Leader
Marathon
Marathon Consul
Event busSubscription
- Event based registration marathon apps in consul,- Forwards data to appropriate consul agents,- Leader aware,- Cyclic resyncs of all information,
Consul Agent
https://github.com/allegro/marathon-consul
marathon-consul
Slave 1
Slave Process
Consul Agent
Slave 2
Slave Process
Consul Agent
Slave 3
Slave Process
Consul Agent
Slave n
Slave Process
Consul Agent
Marathon Leader
Marathon
Marathon Consul
Mesos Master
Schedule
Register running tasks
Service Discovery
Schedule
Hermes - KafkaConsul Master
Consul Server
ConwatchConsul polling Publish event
Marathon Leader
Running app
Service lookupDNS or RESTConsul agent
Consul Master ( aka discovery Service)
Service Discovery
Production of discovery events
Discovery lookup
Discovery Service
{ "ID": "mesoscon2016", "Name": "mesoscon2016", "Tags": [
"std-srv","v1"
], "Address": "127.0.0.1", "Port": 8000}
$ curl -X POST -d @register_service_on_agent.json 127.0.0.1:8500/v1/agent/service/register$ curl 127.0.0.1:8500/v1/agent/services | python -m json.tool
"mesoscon2016": { "Address": "127.0.0.1", "EnableTagOverride": false, "ID": "mesoscon2016", "ModifyIndex": 0, "Port": 8000, "Service": "mesoscon2016", "Tags": [ "Std-srv",…..
SSL Service
- Custom mesos hook,
- Part of microservice
contract,
- Vault as CA solution,
- Short term
certificates/keys,
- Generated for each
instance. 20
Slave 1
Slave ProcessVault
certhook
Extend env
Executor
service Consul
app_x app_ySSL mutual mode
Storage
Application usage
SSL Service
Application environment setup
Config Service
- Secure storage,
- Fetch in mutual ssl,
- Version controlled config,
- Auth apps only,
- Ease to use,
- Peer review of changes.
Starting App Config serviceMutual SSL
Git repository
Revision X
Revision Y
Revision ZEncrypted Data
Fetch config data
Get revision and environment config
Encrypted Valuable DataConfigured git repo
Git push
Config Service
Push configuration
MAAS
- Metrics collected,
- Dashboards set,
- Service owners get
notified,
- Triggers, not
mandatory,
- Multiple monitoring
solutions,
24
Graphite
Mesos Slave
Git repo
Diamond Collector
Mesos Master
Diamond Collector
MAASGrafana
Cabot
Checks definitions
Triggers
Notifications
Kafka - Hermes
Developer
Mesos Cluster Events
SubscriptionEmail
Pagerduty
Events
Eve
nts
Notify
Metric
MAAS
LBAAS
26
- Based on discovery,- Available through discovery tags,- HAproxy at the core.
Haproxy
VarnishVAAS
LBAAS
Consul
Service Catalog
Service X Information
Service Y Information
Instance x
Instance y
Instance x
Instance y
KAFKA/HERMES
Register instance
Unregister instance
Disco
Pub/Sub
REST Config
LBAAS
Mesos Agent
Mesos Master
Graphite
MAAS
Kafka
Consul Server
Vault
Consul Agent
Conwatch
VAAS
Marathonconsul
Mesos Agent
Implementation
Demo
Figures
What our Mesos Ecosystem gives our devs:
What our Mesos Ecosystem gives our devs:
1. Fast and easy deployment of new applications
What our Mesos Ecosystem gives our devs:
1. Fast and easy deployment of new applications
2. Standardization (e.g. out-of-the-box monitoring tools)
What our Mesos Ecosystem gives our devs:
1. Fast and easy deployment of new applications
2. Standardization (e.g. out-of-the-box monitoring tools)
3. Automation
What our Mesos Ecosystem gives our devs:
1. Fast and easy deployment of new applications
2. Standardization (e.g. out-of-the-box monitoring tools)
3. Automation4. Self-healing
// solved
The bumpy road
Netisolation killing slaves
Netisolation killing slaves
- Enabled isolation,
- Many cyclic deploys, on test env,
- Consulted our fellow mesos developers,
- Decided to disable it,
- Problem solved,
Marathon registers multiple times
- On error while getting znode data,
- Marathon registers with other framework id,
- Exhausting resources in cluster,
- After version 0.14 behaviour changed,
- Now marathon just waits,
- Maybe problem on zookeeper maybe not, solved anyway.
Deploy constraints
Deploy constraints
- We want cross dc/zones instances,
- Working unpredictable,
- Taking into account applications which are going to be downed,
- Multi constraint definitions prone to be unpredictable.
- Solved in newest version, so far.
Readiness checks ...
Readiness checks ...
- Application are deployed and upgraded in blue green principle,
- Recently started instanced not ready to handle load,- No standard mechanism for checking applications are really
running,- Check is passed ? no ? doesn't matter,- We developed custom service wrapper.
The bumpy road# occuring
- DC failure, AWS standby master for quorum,
- Application scaling, usage vs allocation (we try creating our
autoscaling)
- Users authorizations, quota for user,
- Graceful shutdown,
- Opened various endpoints, without authorization.
In a nutshell - you have seen
● Our Mesos Ecosystem● Our deployment● Our bumpy road
Mesos - it takes some time and effort
Mesos - it takes some time and effort, but it's worth it.