apache mesos ecosystem at allegro first year of production use

48
Apache Mesos Ecosystem at Allegro - First Year of Production Use Wojciech Lesicki - Product Manager Tomasz Ziarko - Software Engineer Allegro

Upload: wojciech-lesicki

Post on 15-Apr-2017

273 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Apache Mesos Ecosystem at Allegro First Year of Production Use

Apache Mesos Ecosystem at Allegro - First Year of Production Use

Wojciech Lesicki - Product ManagerTomasz Ziarko - Software Engineer

Allegro

Wojciech Lesicki
Do wywalenia.
Page 2: Apache Mesos Ecosystem at Allegro First Year of Production Use

● What we do in Allegro?● Our Mesos Ecosystem● How we deploy apps?● Problems we’ve had● Q&A

Agenda

Page 3: Apache Mesos Ecosystem at Allegro First Year of Production Use

What is Allegro?

Page 4: Apache Mesos Ecosystem at Allegro First Year of Production Use

Allegro

● 16 years on the market● Started as an auction site and now the

biggest e-commerce company in Poland and one of the biggest in Central and Eastern Europe

● 50% of e-commerce market and 80% of m-commerce market in Poland

● 623 items sold every minute

Page 5: Apache Mesos Ecosystem at Allegro First Year of Production Use

● 14 mln users (37% population of Poland)

● 201 mln visits, 3 billion page views per month

Page 6: Apache Mesos Ecosystem at Allegro First Year of Production Use

Our infrastructure and IT

● Two DC● Openstack - 510 hosts,

20128 CPU, 5537 VM+BaaS with openstack Ironic

● Monolith (PHP) and microservices● Around 500 people in IT, most of them

are software engineers

Page 7: Apache Mesos Ecosystem at Allegro First Year of Production Use

Ok, so why we need Mesos?

Page 8: Apache Mesos Ecosystem at Allegro First Year of Production Use
Page 9: Apache Mesos Ecosystem at Allegro First Year of Production Use

Our deployment before Mesos

● No standards, no procedures● Every team did deployment their own way● Inefficient

Page 10: Apache Mesos Ecosystem at Allegro First Year of Production Use

Architecture

Page 11: Apache Mesos Ecosystem at Allegro First Year of Production Use

Openstack

Mesos Slave

Mesos Executor

Mesos Slave

Docker Executor

Mesos Master

Discovery agent Discovery Agent

Zookeeper

Marathon

Discovery

Consul

- 100 % openstack (VM + bare metal)

- marathon as scheduler,

- sync, state, election - zookeeper,

- service discovery - consul,

- separated mesos and docker containerizer.

11

Implementation

Page 12: Apache Mesos Ecosystem at Allegro First Year of Production Use

- multiple clusters,

- each spawned across two datacenters,

- separate ecosystem,

- fair-share distribution between data centers.

- Prod (105 slaves, 1000 CPU)

- Test (96 slaves, 368 CPU)

- Dev (30 slaves, 120 CPU)

dc1 dc2

Prod Network

Prod Mesos Cluster

Test Network

Test Mesos Cluster

Dev Network

Dev Mesos Cluster

12

Implementation

Page 13: Apache Mesos Ecosystem at Allegro First Year of Production Use

Implementation

$ terraform apply -var "buildnr=setup234" \ -var "branch=mesoscon2016" \-var "marathon_version=0.15.3-1.ubuntu1404" \-var "mesos_version=0.28.0-1boost+glog+protobuf" \-var 'masters.dc1=1' \-var 'slaves.dc1=2' \-var ‘slaves.dc2=1’

openstack_compute_instance_v2.mesos-master-dc1: Refreshing state... (ID: ce86ab7a-3660-4702-bba0-5825ae2350b1)

openstack_compute_instance_v2.mesos-slave-dc1.1: Refreshing state... (ID: 39bfd9c1-f6b0-4056-a3ac-28b0136cb220)

openstack_compute_instance_v2.mesos-slave-dc1.0: Refreshing state... (ID: acfb2e86-b4d1-44bd-b9e0-2eb4685a76ff)

openstack_compute_instance_v2.mesos-slave-dc2.0: Creating…….Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Page 14: Apache Mesos Ecosystem at Allegro First Year of Production Use

14

MESOS

Discovery

Config service

SSL Service

MaaS

LBaaS

AppEngine Console (e.q. Bamboo, Stash, Artifactory)

Implementation

Page 15: Apache Mesos Ecosystem at Allegro First Year of Production Use

Service Discovery

- Registering inside cluster,

- Automatic or manual registration,

- Fail detection, changes detection,

- DC aware services.15

Page 16: Apache Mesos Ecosystem at Allegro First Year of Production Use

Service Discovery

Marathon Leader

Marathon

Marathon Consul

Event busSubscription

- Event based registration marathon apps in consul,- Forwards data to appropriate consul agents,- Leader aware,- Cyclic resyncs of all information,

Consul Agent

https://github.com/allegro/marathon-consul

marathon-consul

Page 17: Apache Mesos Ecosystem at Allegro First Year of Production Use

Slave 1

Slave Process

Consul Agent

Slave 2

Slave Process

Consul Agent

Slave 3

Slave Process

Consul Agent

Slave n

Slave Process

Consul Agent

Marathon Leader

Marathon

Marathon Consul

Mesos Master

Schedule

Register running tasks

Service Discovery

Schedule

Page 18: Apache Mesos Ecosystem at Allegro First Year of Production Use

Hermes - KafkaConsul Master

Consul Server

ConwatchConsul polling Publish event

Marathon Leader

Running app

Service lookupDNS or RESTConsul agent

Consul Master ( aka discovery Service)

Service Discovery

Production of discovery events

Discovery lookup

Page 19: Apache Mesos Ecosystem at Allegro First Year of Production Use

Discovery Service

{ "ID": "mesoscon2016", "Name": "mesoscon2016", "Tags": [

"std-srv","v1"

], "Address": "127.0.0.1", "Port": 8000}

$ curl -X POST -d @register_service_on_agent.json 127.0.0.1:8500/v1/agent/service/register$ curl 127.0.0.1:8500/v1/agent/services | python -m json.tool

"mesoscon2016": { "Address": "127.0.0.1", "EnableTagOverride": false, "ID": "mesoscon2016", "ModifyIndex": 0, "Port": 8000, "Service": "mesoscon2016", "Tags": [ "Std-srv",…..

Page 20: Apache Mesos Ecosystem at Allegro First Year of Production Use

SSL Service

- Custom mesos hook,

- Part of microservice

contract,

- Vault as CA solution,

- Short term

certificates/keys,

- Generated for each

instance. 20

Page 21: Apache Mesos Ecosystem at Allegro First Year of Production Use

Slave 1

Slave ProcessVault

certhook

Extend env

Executor

service Consul

app_x app_ySSL mutual mode

Storage

Application usage

SSL Service

Application environment setup

Page 22: Apache Mesos Ecosystem at Allegro First Year of Production Use

Config Service

- Secure storage,

- Fetch in mutual ssl,

- Version controlled config,

- Auth apps only,

- Ease to use,

- Peer review of changes.

Page 23: Apache Mesos Ecosystem at Allegro First Year of Production Use

Starting App Config serviceMutual SSL

Git repository

Revision X

Revision Y

Revision ZEncrypted Data

Fetch config data

Get revision and environment config

Encrypted Valuable DataConfigured git repo

Git push

Config Service

Push configuration

Page 24: Apache Mesos Ecosystem at Allegro First Year of Production Use

MAAS

- Metrics collected,

- Dashboards set,

- Service owners get

notified,

- Triggers, not

mandatory,

- Multiple monitoring

solutions,

24

Page 25: Apache Mesos Ecosystem at Allegro First Year of Production Use

Graphite

Mesos Slave

Git repo

Diamond Collector

Mesos Master

Diamond Collector

MAASGrafana

Cabot

Checks definitions

Triggers

Notifications

Kafka - Hermes

Developer

Mesos Cluster Events

SubscriptionEmail

Pagerduty

Events

Eve

nts

Notify

Metric

MAAS

Page 26: Apache Mesos Ecosystem at Allegro First Year of Production Use

LBAAS

26

- Based on discovery,- Available through discovery tags,- HAproxy at the core.

Page 27: Apache Mesos Ecosystem at Allegro First Year of Production Use

Haproxy

VarnishVAAS

LBAAS

Consul

Service Catalog

Service X Information

Service Y Information

Instance x

Instance y

Instance x

Instance y

KAFKA/HERMES

Register instance

Unregister instance

Disco

Pub/Sub

REST Config

LBAAS

Page 28: Apache Mesos Ecosystem at Allegro First Year of Production Use

Mesos Agent

Mesos Master

Graphite

MAAS

Kafka

Consul Server

Vault

Consul Agent

Conwatch

VAAS

Marathonconsul

Mesos Agent

Implementation

Page 29: Apache Mesos Ecosystem at Allegro First Year of Production Use

Demo

Page 30: Apache Mesos Ecosystem at Allegro First Year of Production Use

Figures

Page 31: Apache Mesos Ecosystem at Allegro First Year of Production Use

What our Mesos Ecosystem gives our devs:

Page 32: Apache Mesos Ecosystem at Allegro First Year of Production Use

What our Mesos Ecosystem gives our devs:

1. Fast and easy deployment of new applications

Page 33: Apache Mesos Ecosystem at Allegro First Year of Production Use

What our Mesos Ecosystem gives our devs:

1. Fast and easy deployment of new applications

2. Standardization (e.g. out-of-the-box monitoring tools)

Page 34: Apache Mesos Ecosystem at Allegro First Year of Production Use

What our Mesos Ecosystem gives our devs:

1. Fast and easy deployment of new applications

2. Standardization (e.g. out-of-the-box monitoring tools)

3. Automation

Page 35: Apache Mesos Ecosystem at Allegro First Year of Production Use

What our Mesos Ecosystem gives our devs:

1. Fast and easy deployment of new applications

2. Standardization (e.g. out-of-the-box monitoring tools)

3. Automation4. Self-healing

Page 36: Apache Mesos Ecosystem at Allegro First Year of Production Use

// solved

The bumpy road

Page 37: Apache Mesos Ecosystem at Allegro First Year of Production Use

Netisolation killing slaves

Page 38: Apache Mesos Ecosystem at Allegro First Year of Production Use

Netisolation killing slaves

- Enabled isolation,

- Many cyclic deploys, on test env,

- Consulted our fellow mesos developers,

- Decided to disable it,

- Problem solved,

Page 39: Apache Mesos Ecosystem at Allegro First Year of Production Use

Marathon registers multiple times

- On error while getting znode data,

- Marathon registers with other framework id,

- Exhausting resources in cluster,

- After version 0.14 behaviour changed,

- Now marathon just waits,

- Maybe problem on zookeeper maybe not, solved anyway.

Page 40: Apache Mesos Ecosystem at Allegro First Year of Production Use

Deploy constraints

Page 41: Apache Mesos Ecosystem at Allegro First Year of Production Use

Deploy constraints

- We want cross dc/zones instances,

- Working unpredictable,

- Taking into account applications which are going to be downed,

- Multi constraint definitions prone to be unpredictable.

- Solved in newest version, so far.

Page 42: Apache Mesos Ecosystem at Allegro First Year of Production Use

Readiness checks ...

Page 43: Apache Mesos Ecosystem at Allegro First Year of Production Use

Readiness checks ...

- Application are deployed and upgraded in blue green principle,

- Recently started instanced not ready to handle load,- No standard mechanism for checking applications are really

running,- Check is passed ? no ? doesn't matter,- We developed custom service wrapper.

Page 44: Apache Mesos Ecosystem at Allegro First Year of Production Use

The bumpy road# occuring

- DC failure, AWS standby master for quorum,

- Application scaling, usage vs allocation (we try creating our

autoscaling)

- Users authorizations, quota for user,

- Graceful shutdown,

- Opened various endpoints, without authorization.

Page 45: Apache Mesos Ecosystem at Allegro First Year of Production Use

In a nutshell - you have seen

● Our Mesos Ecosystem● Our deployment● Our bumpy road

Page 46: Apache Mesos Ecosystem at Allegro First Year of Production Use

Mesos - it takes some time and effort

Page 47: Apache Mesos Ecosystem at Allegro First Year of Production Use

Mesos - it takes some time and effort, but it's worth it.