a jouney through wonderland - jimdo
TRANSCRIPT
A Journey Through Wonderland
Paul SeiffertMathias Lafeldt
The Purpose of
Wonderland
● Took Jimdo 5 years to migrate core infrastructure from bare metal to AWS
● Teams started to love the cloud● Many experiments in different AWS
accounts● “Reinvented” production stacks
How we got here
● Founded to solve common infrastructure problems of Jimdo teams
● Provides standard platform that is reliable and simple to use: Wonderland
● Allows Jimdo developers to focus on product development
Werkzeugschmiede Team
Wonderland’s History
Wonderland 101
PaaS allowing
Jimdo developers
to run their
dockerized applications
● Long-running stateless services○ DNS, load balancing, health checks,
auto scaling, …● One-off tasks and cron jobs● Centralized logging and metrics
collection via external providers
Features
● APIs● CLI tool wl● Chatbot Alice● Docker registry● Vault● No SSH access
Interfaces
● SLA● Status page● Documentation● Workshops● Use-case-driven development
Internal service provider
Wonderland Internals
We run...
● AWS infrastructure● Services providing our APIs
AWS Infrastructure
● Networking● Cluster of EC2 instances● Jenkins● Route 53, DynamoDB, S3, SQS, SNS, ...
“Crims” Cluster
● Runs user applications + system services● EC2 auto-scaling group● Providing resources to ECS● CoreOS
AWS ECS
AWS EC2 Auto-Scaling Group
Two-Dimensional
● Services (based on resource consumption)
● Cluster(based on available slots)
Auto-Scaling
AWS/AutoScaling GroupDesiredCapacity
Wonderland/ECSDesiredClusterSizeDelta
1 week
ECSAgent
Log Forwarder
DatadogAgent
AWS ECSService
A
Service B
Service C
ELB
ELB
HTTP :80
HTTPS :443
HTTP :11411
TCP :1234 TCP :11412
A Crims Cluster Instance
● Infrastructure as code● CloudFormation and Ansible● Applied by a Central State Enforcer● Workflow based on GitHub pull requests● Automated rollout to production
Infrastructure Development
● We test everything● Unit, integration, and system tests● Tests in staging environment● Staging is set up from scratch every week● Periodic GameDays
QA
Our Services
● provide APIs● deploy other services● are Wonderland services
SQS Queue
StatusCheck
ServiceAutoScaler
Deployer API
(Dash-)Boards
Oraculum(Logs)
AWSRoute53
AWSApplicationAutoScaling
Notifi-cations
AWS SNS
Alice(Chatbot)
Deployer Worker
WL (CLI Tool)
AWSS3
Service Configuration$ cat wonderland-autoscaler/wonderland.yaml---scale: 2components: - name: autoscaler image: registry.example.com/wonderland-autoscaler:v1.0.3 env: DYNAMODB_TABLE_NAME: wonderland-autoscaling-configsendpoint: domain: autoscaler.example.com load-balancer: healthcheck: path: /v1/health ports: - port: 443 protocol: HTTPS component: autoscaler port: 80
Deploy it!$ wl deploy autoscaler -f wonderland-autoscaler/wonderland.yamlautoscaler/1466583476 This is try 1autoscaler/1466583476 Updating ELB autoscaler-1466437217autoscaler/1466583476 Configuring health check HTTP:11011/v1/healthautoscaler/1466583476 Enabling cross-zone load balancingautoscaler/1466583476 Configuring connection draining with a timeout of 180sautoscaler/1466583476 Not enabling access logautoscaler/1466583476 Letting autoscaler.example.com point to autoscaler-1363526915.eu-west-1.elb.amazonaws.comautoscaler/1466583476 Registered new ECS TaskDefinition (autoscaler:58) for service autoscalerautoscaler/1466583476 Updating ECS service autoscaler-1466437217autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 180s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 170s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 160s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 150s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 140s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 130s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 120s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 110s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 100s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 90s)autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 80s)autoscaler/1466583476 Rolling update completed successfully.autoscaler/1466583476 Waiting for ELB to have at least one healthy instanceautoscaler/1466583476 Deleting old ECS Task Definition service-autoscaler:57autoscaler/1466583476 Marking deployment autoscaler/1466583476 activeautoscaler/1466583476 [Boards] Creating Board for Service [werkzeugschmiede] autoscalerautoscaler/1466583476 [Datadog] Creating Deployment Eventautoscaler/1466583476 [Notifications] Notification channel is /v1/teams/werkzeugschmiede/channels/autoscalerautoscaler/1466583476 [StatusCheck] CheckID is f85ded4d-9ad0-4375-81b4-5989964e8ed5autoscaler/1466583476 Deployment successful
Monitor it!$ wl status autoscalerCurrent deployment: 1466583491Desired scale: 2
Machine Component Status Started Deployment ELB------- --------- ------ ------- ---------- ---i-7db992f7 autoscaler RUNNING 22 Jun 16 11:14 CEST 1466583491 InServicei-fb2f5b77 autoscaler RUNNING 24 Jun 16 01:13 CEST 1466583491 InService
$ wl logs -f autoscaler...
The Future
● Persistent disk storage● Dynamic load balancing● Long-running / memory hungry jobs● Speed up ECS cluster rotation● Make crons more reliable● Outsource Docker registry
Improvements
Twitter: @seiffertp / @mlafeldt
https://medium.com/production-ready
Questions?
Thank you.