container orchestration from theory to practice

Stephen Day & Laura Frank

Container Orchestration from Theory to Practice

Stephen DayDocker, Inc.

[email protected]@stevvooe

Laura FrankCodeship

[email protected]@rhein_wein

Agenda● Understanding the SwarmKit object

model● Node topology● Distributed consensus and Raft

An open source framework for building orchestration systemsgithub.com/docker/swarmkit

SwarmKit

containerd

kubernetes

docker

swarmkitorchestration

container runtime

user tooling

OrchestrationThe Old Way

Cluster

https://en.wikipedia.org/wiki/Control_theory

OrchestrationA control system for your cluster

ClusterO

-

Δ SD

D = Desired StateO = OrchestratorC = ClusterS = StateΔ = Operations to converge S to D

https://en.wikipedia.org/wiki/Control_theory

ConvergenceA functional view D = Desired State

O = OrchestratorC = ClusterSt = State at time t

f(D, St-1, C) → St | min(St -D)

Observability and ControllabilityThe Problem

Low Observability High Observability

Failure Process State User Input

SwarmKit Object Model

Data Model Requirements● Represent difference in cluster state● Maximize observability● Support convergence● Do this while being extensible and reliable

Show me your data structures and I’ll show you your orchestration system

Services● Express desired state of the cluster● Abstraction to control a set of containers● Enumerates resources, network availability, placement● Leave the details of runtime to container process● Implement these services by distributing processes across a

cluster

Node 1 Node 2 Node 3

message ServiceSpec {// Task defines the task template this service will spawn.TaskSpec task = 2 [(gogoproto.nullable) = false];

// UpdateConfig controls the rate and policy of updates.UpdateConfig update = 6;

// Service endpoint specifies the user provided configuration// to properly discover and load balance a service.EndpointSpec endpoint = 8;

}

Service SpecProtobuf Example

message Service {ServiceSpec spec = 3;

// UpdateStatus contains the status of an update, if one is in// progress.UpdateStatus update_status = 5;

// Runtime state of service endpoint. This may be different// from the spec version because the user may not have entered// the optional fields like node_port or virtual_ip and it// could be auto allocated by the system.Endpoint endpoint = 4;

}

ServiceProtobuf Example

ObjectCurrent State

SpecDesired State

ReconciliationSpec -> Object

Prepare: setup resourcesStart: start the taskWait: wait until task exitsShutdown: stop task, cleanly

Task ModelRuntime

message TaskSpec {oneof runtime {

NetworkAttachmentSpec attachment = 8;ContainerSpec container = 1;

}// Resource requirements for the container.ResourceRequirements resources = 2;// RestartPolicy specifies what to do when a task fails or finishes.RestartPolicy restart = 4;// Placement specifies node selection constraintsPlacement placement = 5;// Networks specifies the list of network attachment// configurations (which specify the network and per-network// aliases) that this task spec is bound to.repeated NetworkAttachmentConfig networks = 7;

}

Task SpecProtobuf Example

message Task {

// Spec defines the desired state of the task as specified by the user.

// The system will honor this and will *never* modify it.

TaskSpec spec = 3 [(gogoproto.nullable) = false];

// DesiredState is the target state for the task. It is set to

// TaskStateRunning when a task is first created, and changed to

// TaskStateShutdown if the manager wants to terminate the task. This field

// is only written by the manager.

TaskState desired_state = 10;

TaskStatus status = 9 [(gogoproto.nullable) = false];

}

TaskProtobuf Example

Orchestrator

Task ModelAtomic Scheduling Unit of SwarmKit

ObjectCurrent State

SpecDesired State

Task0 Task1… Taskn Scheduler

Kubernetestype Deployment struct {

// Specification of the desired behavior of the Deployment.

// +optional

Spec DeploymentSpec

// Most recently observed status of the Deployment.

// +optional

Status DeploymentStatus

}

Parallel Concepts

Declarative$ docker network create -d overlay backend vpd5z57ig445ugtcibr11kiiz

$ docker service create -p 6379:6379 --network backend redis pe5kyzetw2wuynfnt7so1jsyl

$ docker service scale serene_euler=3 serene_euler scaled to 3

$ docker service ls ID NAME REPLICAS IMAGE COMMANDpe5kyzetw2wu serene_euler 3/3 redis

Peanut Butter Demo Time!

And now with jelly!

Consistency

Manager

TaskTask

Data FlowServiceSpec

TaskSpec

Service

ServiceSpec

TaskSpec

Task

TaskSpec

Worker

Only one component of the system can write to a field

Field OwnershipConsistency

message Task {

TaskSpec spec = 3;

string service_id = 4;

uint64 slot = 5;

string node_id = 6;

TaskStatus status = 9;

TaskState desired_state = 10;

repeated NetworkAttachment networks = 11;

Endpoint endpoint = 12;

Driver log_driver = 13;

}

Owner

User

Orchestrator

Allocator

Scheduler

Shared

TaskProtobuf Example

State Owner

< Assigned Manager

>= Assigned Worker

Field HandoffTask Status

Worker

Pre-Run

Preparing

Manager

Terminal States

Task StateNew Allocated Assigned

Ready Starting

Running

Complete

Shutdown

Failed

Rejected

Observability and ControllabilityThe Problem

Low Observability High Observability

Failure Task/Pod State User Input

Node Topology in SwarmKit

We’ve got a bunch of nodes… now what?

Task management is only half the fun. These tasks can be scheduled across a cluster for HA!

We’ve got a bunch of nodes… now what?

Your cluster needs to communicate.

● Not the mechanism, but the sequence and frequency

● Three scenarios important to us now○ Node registration○ Workload dispatching○ State reporting

Manager <-> Worker Communication

● Most approaches take the form of two patterns○ Push, where the Managers push to the

Workers○ Pull, where the Workers pull from the

Managers

Manager <-> Worker Communication

3 - Payload

1 - Register

2 - Discover

Registration &Payload

Push Model Pull Model

Payload

Pros Provides better control over communication rate

- Managers decide when to contact Workers

Cons

Requires a discovery mechanism

- More failure scenarios

- Harder to troubleshoot

Pros Simpler to operate- Workers connect to

Managers and don’t need to bind

- Can easily traverse networks- Easier to secure- Fewer moving parts

Cons Workers must maintain connection to Managers at all times

Push Model Pull Model

● Fetching logs● Attaching to running pods via

kubectl● Port-forwarding

Manager

Kubernetes

● Work dispatching in batches● Next heartbeat timeout

Manager

Worker

SwarmKit

● Always open● Self-registration● Healthchecks● Task status● ...

Manager

Worker

SwarmKit

Rate Control in SwarmKit

Rate Control: Heartbeats

● Manager dictates heartbeat rate to Workers

● Rate is configurable (not by end user)

● Managers agree on same rate by consensus via Raft

● Managers add jitter so pings are spread over time (avoid bursts)

Ping? Pong!Ping me back in 5.2 seconds

Rate Control: Workloads

● Worker opens a gRPC stream to receive workloads

● Manager can send data whenever it wants to

● Manager will send data in batches

● Changes are buffered and sent in batches of 100 or every 100 ms, whichever occurs first

● Adds little delay (at most 100ms) but drastically reduces amount of communication

Give me work to do

100ms - [Batch of 12 ]200ms - [Batch of 26 ]300ms - [Batch of 32 ]340ms - [Batch of 100]360ms - [Batch of 100]460ms - [Batch of 42 ]560ms - [Batch of 23 ]

Distributed Consensus

The Raft Consensus Algorithm● This talk is about SwarmKit, but Raft is also

used by Kubernetes● The implementation is slightly different but

the behavior is the same

The Raft Consensus AlgorithmSwarmKit and Kubernetes don’t make the decision about how to handle leader election and log replication, Raft does!

github.com/kubernetes/kubernetes/tree/master/vendor/github.com/coreos/etcd/raft

github.com/docker/swarmkit/tree/master/vendor/github.com/coreos/etcd/raft

No really...

The Raft Consensus Algorithm

secretlivesofdata.comWant to know more about Raft?

Single Source of Truth● SwarmKit implements Raft directly, instead

of via etcd● The logs live in /var/lib/docker/swarm● Or /var/lib/etcd/member/wal● Easy to observe (and even read)

Manager <-> Worker CommunicationIn Swarmkit

Communication

Leader FollowerFollower ● Worker can connect to any reachable Manager

● Followers will forward traffic to the Leader

Reducing Network Load● Followers multiplex all

workers to the Leader using a single connection

● Backed by gRPC channels (HTTP/2 streams)

● Reduces Leader networking load by spreading the connections evenlyExample: On a cluster with 10,000 workers and 5 managers,

each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader.

Leader FollowerFollower


Leader Failure (Raft!)

● Upon Leader failure, a new one is elected

● All managers start redirecting worker traffic to the new one

● Transparent to workers

Leader Election (Raft!)

Follower FollowerLeader ● Upon Leader failure, a new one is elected

● All managers start redirecting worker traffic to the new one

● Transparent to workers

Manager Failure

- Manager 1 Addr- Manager 2 Addr- Manager 3 Addr

● Manager sends list of all managers’ addresses to Workers

● When a new manager joins, all workers are notified

● Upon manager failure, workers will reconnect to a different manager



Manager Failure (Worker POV)● Manager sends list of all

managers’ addresses to Workers



Manager Failure (Worker POV)


Reconnect to random manager

● Manager sends list of all managers’ addresses to Workers



PresenceScalable presence in a distributed environment

Presence● Swarm still handles node management, even if you use the

Kubernetes scheduler● Manager Leader commits Worker state (Up or Down) into

Raft○ Propagates to all managers via Raft○ Recoverable in case of leader re-election

Presence● Heartbeat TTLs kept in Manager Leader memory

○ Otherise, every ping would result in a quorum write○ Leader keeps worker<->TTL in a heap (time.AfterFunc)○ Upon leader failover workers are given a grace period to

reconnect■ Workers considered Unknown until they reconnect■ If they do they move back to Up■ If they don’t they move to Down

Consistency

Sequencer

● Every object in the store has a Version field● Version stores the Raft index when the object was last

updated● Updates must provide a base Version; are rejected if it is out

of date● Provides CAS semantics● Also exposed through API calls that change objects in the

store

Versioned UpdatesConsistency

service := getCurrentService()

spec := service.Spec

spec.Image = "my.serv/myimage:mytag"

update(spec, service.Version)

SequencerOriginal object:

Raft index when it was last updatedService ABC

Version = 189

SpecReplicas = 4Image = registry:2.3.0...

Sequencer

Service ABC


Version = 189

Service ABC


Version = 189

Update request: Original object:

Sequencer

Service ABC


Version = 189

Original object:

Service ABC


Version = 189

Update request:

Sequencer

Service ABC


Version = 190

Updated object:

Sequencer

Service ABC


Version = 190

Service ABC


Version = 189

Update request: Updated object:

github.com/docker/swarmkitgithub.com/coreos/etcd (Raft)

Thank you!

@stevvooe@rhein_wein

container orchestration from theory to practice

Technology