ci and cd at scale: scaling jenkins with docker and apache mesos

63
CI AND CD AT SCALE SCALING JENKINS WITH DOCKER AND APACHE MESOS Carlos Sanchez @csanchez csanchez.org Watch online at carlossg.github.io/presentations

Upload: carlos-sanchez

Post on 07-Jan-2017

292 views

Category:

Software


3 download

TRANSCRIPT

Page 1: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

CI AND CD AT SCALE

SCALING JENKINS WITH DOCKERAND APACHE MESOS

Carlos Sanchez

@csanchez csanchez.org

Watch online at carlossg.github.io/presentations

Page 2: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

ABOUT MESenior So�ware Engineer @ CloudBees

Contributor to the Jenkins Mesos plugin and the JavaMarathon client

Author of Jenkins Kubernetes plugin

Long time OSS contributor at Apache, Eclipse, Puppet,…

Page 3: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

OUR USE CASE

Scaling JenkinsYour mileage may vary

Page 4: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

SCALING JENKINSTwo options:

More build agents per masterMore masters

Page 5: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

SCALING JENKINS: MORE BUILDAGENTS

Pros

Multiple plugins to add more agents, even dynamically

Cons

The master is still a SPOFHandling multiple configurations, plugin versions,...There is a limit on how many build agents can beattached

Page 6: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

SCALING JENKINS: MORE MASTERSPros

Different sub-organizations can self service and operateindependently

Cons

Single Sign-OnCentralized configuration and operation

Page 7: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

CLOUDBEES JENKINS ENTERPRISE EDITIONCloudBees Jenkins Operations Center

Page 8: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

CLOUDBEES JENKINS PLATFORM - PRIVATESAAS EDITION

The best of both worlds

CloudBees Jenkins Operations Center with multiple masters

Dynamic build agent creation in each master

ElasticSearch for Jenkins metrics and Logstash

Page 9: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

BUT IT IS NOT TRIVIAL

Page 10: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

A 2000 JENKINS MASTERS CLUSTER

Page 11: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos
Page 12: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos
Page 13: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos
Page 14: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos
Page 15: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

A 2000 JENKINS MASTERS CLUSTER3 Mesos masters (m3.xlarge: 4 vCPU, 15GB, 2x40 SSD)317 Mesos slaves (c3.2xlarge, m3.xlarge, m4.4xlarge)7 Mesos slaves dedicated to ElasticSearch: (c3.8xlarge: 32vCPU, 60GB)

12.5 TB - 3748 CPU

Running 2000 masters and ~8000 concurrent jobs

Page 16: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

ARCHITECTUREDocker Docker Docker

Page 17: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

Isolated Jenkins masters

Isolated build agents and jobs

Memory and CPU limits

Page 18: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

How would you design your infrastructure ifyou couldn't login? Ever.

Kelsey Hightower

Page 19: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

EMBRACE FAILURE!

Page 20: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

CLUSTER SCHEDULINGRunning in public cloud, private cloud, VMs or bare metal

Starting with AWS and OpenStackHA and fault tolerantWith Docker support of course

Page 21: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

APACHE MESOS

A distributed systems kernel

Page 22: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

ALTERNATIVES

Docker Swarm / Kubernetes

Page 23: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

MESOSPHERE MARATHON

For long running Jenkins masters

<1.4 does not scale with the number of apps

App definitions hit the ZooKeeper node limit

Page 24: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

TERRAFORM

Page 25: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

TERRAFORMresource "aws_instance" "worker" { count = 1 instance_type = "m3.large" ami = "ami-xxxxxx" key_name = "tiger-csanchez" security_groups = ["sg-61bc8c18"] subnet_id = "subnet-xxxxxx" associate_public_ip_address = true tags { Name = "tiger-csanchez-worker-1" "cloudbees:pse:cluster" = "tiger-csanchez" "cloudbees:pse:type" = "worker" } root_block_device { volume_size = 50 } }

Page 26: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

TERRAFORMState is managedRuns are idempotentterraform apply

Sometimes it is too automaticChanging image id will restart all instances

Had to fix a number of bugs, ie. retry AWS calls

Page 27: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos
Page 28: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

Preinstall packages: Mesos, Marathon, DockerCached docker imagesOther drivers: XFS, NFS,...Enhanced networking driver (AWS)

Page 29: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

MESOS FRAMEWORKStarted with Jenkins Mesos plugin

Means one framework per Jenkins master, does not scale

If master is restarted all jobs running get killed

Page 30: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

OUR NEW MESOS FRAMEWORKUsing Netflix Fenzo

Runs under Marathon, exposes REST API that Jenkinsmasters call

Reduce number of frameworksFaster to spawn new build agents because framework isnot startedPipeline durable builds, can survive a restart of the masterDedicated workers for buildsAffinity

Page 31: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

STORAGEHandling distributed storage

Servers can start in any host of the cluster

And they can move when they are restarted

Jenkins masters need persistent storage, agents (typically)don't

Supporting EBS (AWS) and external NFS

Page 32: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

SIDEKICK CONTAINERA privileged container that manages mounting for other

containers

Can execute commands in the host and other containers

Page 33: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

SIDEKICK CONTAINER CASTLERunning in Marathon in each host

"constraints": [ [ "hostname", "UNIQUE" ] ]

Page 34: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

A lot of magic happening with nsenter

both in host and other containers

Page 35: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

Jenkins master container requests data on startup usingentrypoint

REST call to CastleCastle checks authenticationCreates necessary storage in the backend

EBS volumes from snapshotsDirectories in NFS backend

Page 36: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

Mounts storage in requesting containerEBS is mounted to host, then bind mounted intocontainerNFS is mounted directly in container

Listens to Docker event stream for killed containers

Page 37: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

CASTLE: BACKUPS AND CLEANUPPeriodically takes snapshots from EBS volumes in AWS

Cleanups happening at different stages and periodically

EMBRACE FAILURE!

Page 38: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

PERMISSIONSContainers should not run as root

Container user id != host user id

i.e. jenkins user in container is always 1000 but matchesubuntu user in host

Page 39: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

CAVEATSOnly a limited number of EBS volumes can be mounted

Docs say /dev/sd[f-p], but /dev/sd[q-z] seem towork too

Sometimes the device gets corrupt and no more EBSvolumes can be mounted there

NFS users must be centralized and match in cluster and NFSserver

Page 40: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

MEMORYScheduler needs to account for container memory

requirements and host available memory

Prevent containers for using more memory than allowed

Memory constrains translate to Docker --memory

Page 41: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

WHAT DO YOU THINK HAPPENSWHEN?

Your container goes over memory quota?

Page 42: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos
Page 43: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

WHAT ABOUT THE JVM?

Page 44: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

WHAT ABOUT THE CHILDPROCESSES?

Page 45: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

CPUScheduler needs to account for container CPU requirements

and host available CPUs

WHAT DO YOU THINK HAPPENSWHEN?

Your container tries to access more than one CPU

Your container goes over CPU limits

Page 46: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

Totally different from memory

CPU translates into Docker \-\-cpu-shares

Page 47: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

OTHERCONSIDERATIONS

ZOMBIE REAPING PROBLEM

Page 48: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

ZOMBIE REAPING PROBLEM

Zombie processes are processes that have terminated buthave not (yet) been waited for by their parent processes.

The init process -- PID 1 -- task is to "adopt" orphaned childprocesses

source

Page 49: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

THIS IS A PROBLEM IN DOCKERJenkins build agent run multiple processes

But Jenkins masters too, and they are long running

Page 50: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

TINISystemd or SysV init is too heavyweight for containers

All Tini does is spawn a single child (Tini ismeant to be run in a container), and waitfor it to exit all the while reaping zombies

and performing signal forwarding.

PROCESS REAPINGDocker 1.9 gave us trouble at scale, rolled back to 1.8

Lots of defunct processes

Page 51: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

NETWORKINGJenkins masters open several ports

HTTPJNLP Build agentSSH server (Jenkins CLI type operations)

Page 52: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

NETWORKING: HTTPWe use a simple nginx reverse proxy for

MesosMarathonElasticSearchCJOCJenkins masters

Gets destination host and port from Marathon

Page 53: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

NETWORKING: HTTPDoing both

domain based routing master1.pse.example.compath based routing pse.example.com/master1

because not everybody can touch the DNS or get awildcard SSL certificate

Page 54: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

NETWORKING: JNLPBuild agents started dynamically in Mesos cluster can

connect to masters internally

Build agents manually started outside cluster get host andport destination from HTTP, then connect directly

Page 55: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

NETWORKING: SSHSSH Gateway Service

Tunnel SSH requests to the correct host

Simple configuration needed in clientHost=*.ci.cloudbees.com ProxyCommand=ssh -q -p 22 ssh.ci.cloudbees.com tunnel %h

allows to runssh master1.ci.cloudbees.com

Page 56: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

SCALINGNew and interesting problems

Hitler uses Docker

Page 57: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos
Page 58: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

TERRAFORM AWSInstancesKeypairsSecurity GroupsS3 bucketsELBVPCs

Page 59: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

AWSResource limits: VPCs, S3 snapshots, some instance sizes

Rate limits: affect the whole account

Retrying is your friend, but with exponential backoff

Page 60: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

AWSRunning with a patched Terraform to overcome timeouts

and AWS eventual consistency<?xml version="1.0" encoding="UTF-8"?> <DescribeVpcsResponse xmlns="http://ec2.amazonaws.com/doc/2015-10-01/" <requestId>8f855bob-3421-4cff-8c36-4b517eb0456c</requestld> <vpcSet> <item> <vpcId>vpc-30136159</vpcId> <state>available</state> <cidrBlock>10.16.0.0/16</cidrBlock> ... </DescribeVpcsResponse> 2016/05/18 12:55:57 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeVpcAttribute Details:--[ RESPONSE] ------------------------------------ HTTP/1.1 400 Bad Request<Response><Errors><Error><Code>InvalidVpcID.NotFound</Code><Message> The vpc ID 'vpc-30136159‘ does not exist</Message></Error></Errors>

Page 61: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

TERRAFORM OPENSTACKInstancesKeypairsSecurity GroupsLoad BalancerNetworks

Page 62: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

OPENSTACKCustom flavors

Custom images

Different CLI commands

There are not two OpenStack installations that are the same