the good parts / the hard parts

29
╔══════════════════════════════════════════╗ The Good Parts / The Hard Parts Noah Zoschke [email protected] @nzoschke 03/01/2016 ╚══════════════════════════════════════════╝ CONVOX Open Source PaaS https://github.com/convox/rack

Upload: noah-zoschke

Post on 12-Apr-2017

900 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: The Good Parts / The Hard Parts

╔══════════════════════════════════════════╗ ║ The Good Parts / The Hard Parts ║ ║ ║ ║ Noah Zoschke ║ ║ [email protected] ║ ║ @nzoschke ║ ║ ║ ║ 03/01/2016 ║ ╚══════════════════════════════════════════╝

CONVOX

Open Source PaaS https://github.com/convox/rack

Page 2: The Good Parts / The Hard Parts

• Provision new infrastructure

• Update base operating system

• Add capacity with horizontal and vertical scaling

• Monitor health

• Handle failures automatically

• Create new apps

• Deploy new code

• Add capacity with horizontal and vertical scaling

• Configure secrets and services

• Debug problems and tune performance

• Monitor health

• Handle failures automatically

MAKE DEVOPS BORING

Page 3: The Good Parts / The Hard Parts

CONVOX OPEN SOURCE TOOLKIT ⟷ IAAS

Racks ⟷ ASG, CF, Dynamo, EC2, ECS, IAM, VPC

Apps ⟷ CF, ECS, ELB

Scale ⟷ ASG, CF, ECS

Environments ⟷ KMS, S3

Builds ⟷ ECR, S3

Logs ⟷ CloudWatch, Kinesis, Lambda

Metrics ⟷ CloudWatch Metrics

Notifications ⟷ SNS

Page 4: The Good Parts / The Hard Parts

$ convox install

___ ___ ___ __ __ ___ __ _ / ___\ / __ \ / _ \/\ \/\ \ / __ \/\ \/ \ /\ \__//\ \_\ \/\ \/\ \ \ \_/ |/\ \_\ \/> </ \ \____\ \____/\ \_\ \_\ \___/ \ \____//\_/\_\ \/____/\/___/ \/_/\/_/\/__/ \/___/ \//\/_/

Installing Convox (20160301181624-ps-docker)... Created CloudWatch Log Group: convox-629-LogGroup-15GUSB6EN2K2X Created ECS Cluster: convox-629-Cluster-MEMQU17FHAI Created VPC Internet Gateway: igw-f976db9d Created VPC: vpc-b97c50dd Created DynamoDB Table: convox-629-builds Created Kinesis Stream: convox-629-Kinesis-1W4W11098ATSZ Created DynamoDB Table: convox-629-releases Created Security Group: sg-a48528dc Created Security Group: sg-a58528dd Created Routing Table: rtb-d7fb0db0 Created Lambda Function: convox-629-CustomTopic-V5MWTXYOE3WK Created KMS Key: EncryptionKey Created VPC Subnet: subnet-5c2f8004 Created Elastic Load Balancer: convox-629 Created ECS TaskDefinition: ApiWebTasks Created ECS TaskDefinition: ApiMonitorTasks Created ECS Service: ApiMonitor Created ECS Service: ApiWeb Created AutoScalingGroup: convox-629-Instances-90LARL67DSMD Created CloudFormation Stack: convox-629 Waiting for load balancer... Logging in... Success, try `convox apps`

CLI PROVISION NEW INFRASTRUCTURE

Page 5: The Good Parts / The Hard Parts

CLI CREATE + DEPLOY APPS

$ convox apps create httpd Creating app httpd... CREATING

$ convox deploy Deploying httpd Creating tarball... OK Uploading... OK RUNNING: docker pull httpd ...

RUNNING: docker tag -f httpd httpd/web RUNNING: docker tag -f httpd/web 568149725493.dkr.ecr.us-east-1.amazonaws.com/httpd-lokxbjnlam:web.BDDAIVOGDRV RUNNING: docker push 568149725493.dkr.ecr.us-east-1.amazonaws.com/httpd-lokxbjnlam:web.BDDAIVOGDRV ...

Promoting RLDKBXUUMLV... UPDATING

Page 6: The Good Parts / The Hard Parts

$ convox apps APP STATUS myapp running

$ convox apps info Name myapp Status running Release REXIQURVKXE Processes admin web Hostname myapp-1749418666.us-east-1.elb.amazonaws.com Ports web:80 web:443 admin:9322

$ convox ps ID NAME RELEASE CPU MEM STARTED COMMAND 13254981d20 admin REXIQURVKXE 0.47% 2.21% 17 hours ago bin/admin 92d4a822c13 web REXIQURVKXE 3.29% 20.68% 17 hours ago bin/web

$ convox env PASSWORD=xyzzy

$ convox logs web: [01/Jan/2015:00:00:00] "GET / HTTP/1.1" 200 554 0.0027 web: [01/Jan/2015:00:00:00] "POST /users HTTP/1.1" 303 - 0.0049

$ convox rack update Updating to 20160220003627

CLI MANAGE EVERYTHING

Page 7: The Good Parts / The Hard Parts

$ convox api get /apps/myapp/processes [ { "app": "myapp", "command": "bin/web", "cpu": 0.0329, "host": "10.0.3.135", "id": "13254981d20", "image": "registry.internal:5000/myapp-web:BHLRYHSMXNM", "memory": 0.2068, "name": "web", "ports": [ "80:3000", "443:3001" ], "release": "REXIQURVKXE", "started": "2015-01-01T00:00:00Z" } ]

API WE DESERVE A REST FROM AWS APIS

Page 8: The Good Parts / The Hard Parts

CONVOX OPEN SOURCE TOOLKIT ⟷ IAAS

Manage ⟷ CloudFormation

Schedule ⟷ EC2 Container Service

Glue ⟷ Lambda

Page 9: The Good Parts / The Hard Parts

INFRASTRUCTURE AUTOMATIONwith CloudFormation

Page 10: The Good Parts / The Hard Parts

PARAMETERIZED INFRASTRUCTURE

→ Ami ami-c5fa5aae → InstanceCount 3 → InstanceType t2.small → Password PuDpyqGTmxBN8ziGJ9UiMfrfGZfHDG → Tenancy default → Version 20151204013151 → VolumeSize 30 → VPCCIDR 10.0.0.0/16

↑ Balancer convox AWS::ElasticLoadBalancing::LoadBalancer ↑ Cluster convox-Cluster-1JI343QBLSMYJ AWS::ECS::Cluster ↑ DynamoBuilds convox-builds AWS::DynamoDB::Table ↑ DynamoReleases convox-releases AWS::DynamoDB::Table ↑ EncryptionKey arn:aws:kms:...:key/d40c0153... Custom::KMSKey ↑ IamRole convox-IamRole-M1YZSNXNS1F7 AWS::IAM::Role ↑ Instances convox-Instances-PCWRQ6OWDWTT AWS::AutoScaling::AutoScalingGroup ↑ Kinesis convox-Kinesis-C09RDWFR8NOE AWS::Kinesis::Stream ↑ NotificationTopic arn:aws:sns:...:convox-notifications AWS::SNS::Topic ↑ Settings convox-settings-13c91daqrj90z AWS::S3::Bucket ↑ Vpc vpc-b27ff8d6 AWS::EC2::VPC

← Dashboard convox-820546104.us-east-1.elb.amazonaws.com ← Kinesis convox-Kinesis-C09RDWFR8NOE

Page 11: The Good Parts / The Hard Parts

PARAMETERIZED CONTAINERS

→ Cluster convox-Cluster-1JI343QBLSMYJ → Cpu 200 → Environment https://httpd-settings-1e3ej4u01z4bv.s3.amazonaws.com/releases/RSAQCOYHGPV/env → Key arn:aws:kms:us-east-1:901416387788:key/d40c0153-4a57-4d50-9ca0-99a974daca11 → Release RSAQCOYHGPV → VPC vpc-b27ff8d6 → WebCommand → WebDesiredCount 1 → WebImage convox-820546104.us-east-1.elb.amazonaws.com:5000/httpd-web:BQIWNCMIYZG → WebMemory 256 → WebPort80Balancer 80 → WebPort80Certificate → WebPort80Host 42563 → WebPort80Secure No

↑ Balancer httpd AWS::ElasticLoadBalancing::LoadBalancer ↑ Kinesis httpd-Kinesis-FO32SUUFLX24 AWS::Kinesis::Stream ↑ LogsAccess AKIAIFI65IDSEURPK62Q AWS::IAM::AccessKey ↑ LogsUser httpd-LogsUser-96BAE2EL9TNL AWS::IAM::User ↑ ServiceRole httpd-ServiceRole-19LN8R18BIVRW AWS::IAM::Role ↑ Settings httpd-settings-1e3ej4u01z4bv AWS::S3::Bucket ↑ WebECSService arn:aws:ecs:...:service/httpd-web-SATOEEBOQNF Custom::ECSService ↑ WebECSTaskDefinition arn:aws:ecs:...:task-definition/httpd-web:6 Custom::ECSTaskDefinition

← BalancerWebHost httpd-908645489.us-east-1.elb.amazonaws.com ← Kinesis httpd-Kinesis-FO32SUUFLX24 ← Settings httpd-settings-1e3ej4u01z4bv ← WebPort80Balancer 80

Page 12: The Good Parts / The Hard Parts

APP MANIFEST ⟷ IAAS┌──────────────────────────────────────────────────────────────────────────────────────────────────┐ │web: Task Definition httpd-web:6 │ │ command: bin/web Service httpd-web-SATOEEBOQNF │ │ build: . Docker Image httpd-web:BQIWNCMIYZG │ │ ports: │ │ - 80:80 ELB 80 : 52452 : 80 │ │ - 443:80 ELB (SSL) 443 : 52452 : 80 │ │ │ │worker: Task Definition httpd-worker:6 │ │ command: bin/worker Service httpd-worker-SHAOPEQONEF │ │ build: . Docker Image httpd-worker:BQIWNCMIYZG (same image, new tag) │ │ links: │ │ - redis REDIS_URL=rer45wxl0uj8jn6.1qae5u.ng.0001.usw2.cache.amazonaws.com:6379│ │ - rabbit RABBIT_URL=httpd-1222973998.us-west-2.elb.amazonaws.com:5672 │ │ │ │rabbit: Task Definition httpd-rabbit:6 │ │ command: rabbitmq-server Service httpd-rabbit-SPNFHGMWNUU │ │ image: rabbitmq Docker Image httpd-rabbit:BQUWNCMIYZG │ │ ports: │ │ - 5672 ELB (Internal) 5672 : 24324 : 5672 │ │ │ │redis: │ │ image: convox/redis AWS::ElastiCache::CacheCluster │ └──────────────────────────────────────────────────────────────────────────────────────────────────┘

Page 13: The Good Parts / The Hard Parts

GLUELambda

Page 14: The Good Parts / The Hard Parts

CLOUDFORMATION LAMBDA CUSTOM RESOURCES

┌─────────────────────────────────────┐ │POST arn:aws:lambda:... │ │{ │ │ ResourceProperties: { │ │ Description: "Master Encryption",│ ┌─────────────────────────────────────┐ ┌───────────────────────────┐ │ KeyUsage: "ENCRYPT_DECRYPT" │ │aws kms create-key \│ │200: OK │ │ } │ │ --description "Master Encryption" \│ │400: LimitExceededException│ │} │ │ --key-usage ENCRYPT_DECRYPT │ │500: KMSInternalException │ └─────────────────────────────────────┘ └─────────────────────────────────────┘ └───────────────────────────┘ ┌────────────────┐ ┌──────────────┐──────────────────────▶┌───────────┐ │ CloudFormation │──────────────────────▶│ Lambda │ │AWS KMS API│ └────────────────┘ CREATE_IN_PROGRESS └──────────────┘◀──────────────────────└───────────┘ ▲ │ │ │ │ CREATE_COMPLETE │ │ OR ▼ │ CREATE_FAILED ┌─────────────┐ └────────────────────────────────│ S3 │ └─────────────┘

Page 15: The Good Parts / The Hard Parts

• Writing templates

• DependsOn

• Transient internal errors

• UPDATE_ROLLBACK_FAILED and DELETE_FAILED

• Migrating custom resources to native resources

• Debugging Lambda

• Sitting helpless during a Lambda outage

• Waiting for things to provision

THE HARD PARTS CLOUDFORMATION + LAMBDA

Page 16: The Good Parts / The Hard Parts

THE HARD PARTS 100% CORRECTNESS

2800+ test clusters across 3 regions...

Page 17: The Good Parts / The Hard Parts

THE GREAT PARTS$ convox rack update

$ convox rack scale --type c3.xlarge --count 10

$ convox rack update <previous release>

• Update convox API quickly

• Update cluster AMIs one at a time and with zero downtime

• Resize instances one at a time and with zero downtime

• Roll out new subsystems like ECR, CloudWatch Logs and NAT Gateways

• Fail towards not modifying working infrastructure

• Roll back to previous good state if something truly unexpected happens

Page 18: The Good Parts / The Hard Parts

CONTAINER AUTOMATIONECS

Page 19: The Good Parts / The Hard Parts

BATTERIES NOT INCLUDEDAPI

• Clusters

• TaskDefinitions

• Tasks

• Services

Bring Your Own

• Instances

• ecs-agent

• Load Balancers

• Logging

• Builds / Images

• Tools...

Page 20: The Good Parts / The Hard Parts

SCALING ONE APP ⟶ MANY SERVICES

Service Name Task Definition Desired Running ═══════════════════════════════════════════════════════════════════════════ myapp-clock-SVQQEUPGZPS myapp-clock:106 1 1 myapp-scheduler-SSMOCJRAGOM myapp-scheduler:183 1 1 myapp-web-SLHARAVBAWZ myapp-web:119 2 2 myapp-runner-SEGBMHLWREH myapp-runner:163 4 4

Page 21: The Good Parts / The Hard Parts

DEBUGGING RUN, EXEC, SSH OVER WEB SOCKETS

$ convox run web bash root@3e4160f0c4d0:/app#

$ convox ps ID NAME RELEASE CPU MEM STARTED COMMAND 551967b75abd web RHQZEJZFCSD 0.39% 21.04% 2 hours ago rails server -b 0.0.0.0 f5ec95c38f58 worker RHQZEJZFCSD 0.00% 30.35% 2 hours ago sidekiq

$ convox exec 551967b75abd bash root@281d0a9c33a:/app#

$ convox exec 551967b75abd ps ax PID USER TIME COMMAND 1 root 0:00 sh -c bin/web 6 root 0:00 {web} /bin/sh bin/web 9 root 0:00 unicorn master -c unicorn.rb 11 root 0:00 unicorn worker[0] -c

Page 22: The Good Parts / The Hard Parts

GLUELambda

Page 23: The Good Parts / The Hard Parts

APP LOGS AGENT, DOCKER APIS, KINESIS, LAMBDA

┌──────────────────────────────────────────────────────────┐ ┌──────────────────┐ │ EC2 Instance in ECS Cluster │ │ app1 Kinesis │ │ │ │ ┌────────┐ │ ┌───────────────────────────────────────────┐ │ ┌──────────────┐ ┌──────────────────────────────────┐ │ ┌─┼───▶│shard 1 │ │──┐ │ Lambda w/ EventSourceMapping │ │ │ │ │ │ │ │ │ └────────┘ │ │ │ ┌──────────────────────────────────────┐ │ │ │ │ │ │ │ │ └──────────────────┘ │ │ │function(event, context) { │ │ │ │ app1 │ │ app2 │ │ │ │ │ │ event.records.forEach(function(r) { │ │ │ │ web.1 │ │ worker.1 │ │ │ │ │ │ winston.info(r.kinesis.data) │ │ │ │ │ │ │ │ │ └─┼▶│ }) │──┼────────▶┌───────────────┐ │ │ │ │ │ │ │ │ │ context.done() │ │ │ │ │ └──────────────┘ └──────────────────────────────────┘ │ │ ┌──────────────────┐ │ │} │ │ │ │ │ │ │ │ │ │ app2 Kinesis │ │ │ │ │ │ │ │ │ ┌─────────────────────┘ │ │ │ ┌────────┐ │ │ └──────────────────────────────────────┘ │ │ Syslog Server │ │ ▼ ▼ │ │ │ ┌─▶│shard 1 │ │ │ ┌────────────────────────────────┐ │ │ │ │ ┌────────────┐ ┌────────────┐─────────────┼───┘ │ │ └────────┘ │ │ │function(event, context) { ... }│──┼────────▶│ │ │ │ dockerd │◀─────────────│convox/agent│─────────────┼─────┼─┘ ┌────────┐ │ │ └────────────────────────────────┘ │ │ │ │ └────────────┘ └────────────┘─────────────┼─────┼───▶│shard 2 │ │ │ ┌────────────────────────────────┐ │ │ │ │ ▲ ┌────────────────────────────────────┐ │ │ └────────┘ │────┼─▶│function(event, context) { ... }│───────┼────────▶└───────────────┘ │ │ │GET docker /events (create) │ │ │ . │ │ └────────────────────────────────┘ │ │ ▼ │ GET ENV "Kinesis", "Process"│ │ │ . │ │ │ │ ┌────────────┐ │ GET Docker /logs?follow=1 │ │ │ . │ └───────────────────────────────────────────┘ │ │ ecs-agent │ │ PUT Kinesis /records │ │ │ ┌────────┐ │ │ └────────────┘ └────────────────────────────────────┘ │ │ │shard N │ │ │ │ │ └────────┘ │ └──────────────────────────────────────────────────────────┘ └──────────────────┘

Page 24: The Good Parts / The Hard Parts

• Setting it all up: VPC, ASG, ELBs, health checks

• Managing instances

• Understanding its distributed state machine

• Rolling deploys

• Container scheduling and re-scheduling

• Capacity problems

• Collecting and making sense of logs and events

THE HARD PARTS ECS

Page 25: The Good Parts / The Hard Parts

• CloudFormation updates

• ECS Task Definition and Service updates

• On-instance observations

• ecs-agent

• dockerd

• convox/agent

• App failures

• crashes

• port unresponsive

• Instance failures

• filesystem lockups

• kernel panics

• General EC2 / ASG health

THE HARD PARTS COMPLEX INTERACTIONS AND FEEDBACK LOOPS

ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd

api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB

rails worker.2256 MB

rails worker.3256 MB

rails web.11024 MB

rails worker.1256 MB

rails worker.4256 MB

ECS

ASG

api ELB rails ELB

Page 26: The Good Parts / The Hard Parts

THE HARD PARTS CONTAINERS EXERCISE NEW KERNEL, NETWORK,

FILESYSTEM PATHS

Page 27: The Good Parts / The Hard Parts

THE GREAT PARTS$ convox deploy

• Configure desired container formation with one API call

• Watch extremely sophisticated automation execute it

• Assure new containers start and are healthy

• Drain old containers

• Trust automation will try its hardest to keep it running

• Re-schedule on observed failures

Page 28: The Good Parts / The Hard Parts

• Provision new infrastructure

• Update base operating system

• Add capacity with horizontal and vertical scaling

• Monitor health

• Handle failures automatically

• Create new apps

• Deploy new code

• Add capacity with horizontal and vertical scaling

• Configure secrets and services

• Debug problems and tune performance

• Monitor health

• Handle failures automatically

CONVOX MAKE DEVOPS BORING

Page 29: The Good Parts / The Hard Parts

[email protected] @nzoschke

Discuss these techniques and get involvedGitHub https://github.com/convox Slack http://invite.convox.com/

_ _ _ _ | |_| |__ __ _ _ __ | | _____| | | __| '_ \ / _` | '_ \| |/ / __| | | |_| | | | (_| | | | | <\__ \_| \__|_| |_|\__,_|_| |_|_|\_\___(_)

(we are hiring)