running & monitoring docker at scale

52
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. November 12 th , 2014 | Las Vegas Monitoring and Running Docker Containers at Scale Alexis Lê-Quôc, Datadog

Upload: datadogslides

Post on 09-Jul-2015

1.195 views

Category:

Software


6 download

DESCRIPTION

Containerization (à la Docker) is increasing the elastic nature of cloud infrastructure by an order of magnitude. If you have adopted Docker, or are considering it, you are probably facing questions like: - How many containers can you run on a given Amazon EC2 instance type? - Which metric should you look at to measure contention? - How do you manage fleets of containers at scale? Datadog’s CTO, Alexis Lê-Quôc, presents the challenges and benefits of running Docker containers at scale. Alexis explains how to use quantitative performance patterns to monitor your infrastructure at the new level of magnitude and increased complexity introduced by containerization.

TRANSCRIPT

Page 1: Running & Monitoring Docker at Scale

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

November 12th, 2014 | Las Vegas

Monitoring and Running Docker Containers at Scale Alexis Lê-Quôc, Datadog

Page 2: Running & Monitoring Docker at Scale

@alq — CTO at Datadog

Page 3: Running & Monitoring Docker at Scale

Datadog

•  Monitoring service •  Made for the cloud •  Aggregates everything •  Support for Docker

(since 1.0)

Page 4: Running & Monitoring Docker at Scale

Goals 1. Present key Docker metrics 2. Explain operational complexity 3. Rethink monitoring of Docker containers

Page 5: Running & Monitoring Docker at Scale

Agenda •  A (very) brief history of containers •  Docker containers on AWS •  Key Docker metrics •  Operational complexity •  Monitoring Docker effectively

•  Demo

Page 6: Running & Monitoring Docker at Scale

A brief history of containers

Page 7: Running & Monitoring Docker at Scale

Containers in a nutshell •  Been around for a long time

–  jails, zones, cgroups •  No full-virtualization overhead •  Used for runtime isolation (e.g. jails) •  Docker: escape from dependency hell

Page 8: Running & Monitoring Docker at Scale

Escape from dependency hell a.out

shared libs

packages

omnibus

Docker ~

Page 9: Running & Monitoring Docker at Scale

Container ~ single static binary Process Container Host

Source Dockerfile Chef/Puppet Kickstart

.TEXT /var/lib/docker Full distro

PID Name/ID Hostname

Page 10: Running & Monitoring Docker at Scale

Docker on AWS: some numbers

Page 11: Running & Monitoring Docker at Scale

(Some) Docker use cases •  Continous integration

–  eliminate dependency variance –  same code from dev laptop to production –  git-like workflow

•  Continuous delivery –  (quasi) stateless components –  web workers, video encoders, etc. –  not for data stores (Amazon RDS a better fit)

Page 12: Running & Monitoring Docker at Scale

Instance types

20% 20% 19%

13%

8%

21%

c3.2xl m3.medium m3.large m3.xlarge m1.large the rest

Source: Datadog, October 2014

Page 13: Running & Monitoring Docker at Scale

Containers per instance •  Average: 5 (October 2014) •  Highly dependent on the workload •  This is just the beginning… •  Expect higher container density going forward

Source: Datadog, October 2014

Page 14: Running & Monitoring Docker at Scale

Key Docker metrics

Page 15: Running & Monitoring Docker at Scale

Monitoring fundamentals Work

Resource consumption

Measures the amount of value created

Measures the amount of resources consumed to create value

What your customers care about What your customers don’t care about

Database: queries answered Web server: requests served Queue: wait time distribution

Database: I/O throughput Web server: active connections OS: CPU utilization Container: memory footprint

Page 16: Running & Monitoring Docker at Scale

Docker containers consume… •  Memory •  CPU •  I/O •  Network

Page 17: Running & Monitoring Docker at Scale

Memory Name Why it matters

pgmajfault Paging to/from disk is slow

pgfault Context switches hurt application performance

resident set size (rss) Too much RSS causes paging and swapping

swap Swapping in/out is slow

Page 18: Running & Monitoring Docker at Scale

CPU Name Why it matters

user Measures work being done

system System calls, a necessary evil

Page 19: Running & Monitoring Docker at Scale

Block I/O Name Why it matters

blkio.io_service_bytes I/O is (often) bottleneck

blkio.io_queued Measures saturation

Page 20: Running & Monitoring Docker at Scale

Network Name Why it matters

tx/rx_errors Because… errors are bad.

tx/rx_dropped Measures contention

tx/rx_bytes Measures traffic

Page 21: Running & Monitoring Docker at Scale

How to collect metrics •  https://github.com/google/cadvisor

Page 22: Running & Monitoring Docker at Scale

Operational complexity

Page 23: Running & Monitoring Docker at Scale

Combinatorial multiplication

Hardware

OS

Off-the-shelf

Your Application

Hardware

Hypervisor

Off-the-shelf

App

OS OS

Off-the-shelf

App

Hardware

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 24: Running & Monitoring Docker at Scale

Operational complexity •  Average containers per instance: N (N=5, 10/2014) •  N-times as many “hosts” to manage •  Affects

–  provisioning: prep’ing & building containers –  configuration: passing config to containers –  orchestration: deciding where/when containers run –  monitoring: making sure containers run properly

Page 25: Running & Monitoring Docker at Scale

Monitoring: metric counts on Amazon EC2

•  1 Amazon EC2 instance –  10 CloudWatch metrics

•  1 operating system (e.g. linux) –  100 metrics

•  1 Container –  50 metrics

•  1 off-the-shelf application –  ~50 metrics

Page 26: Running & Monitoring Docker at Scale

Combinatorial multiplication

100 500 instances containers

Assuming only 5 containers per instance

Page 27: Running & Monitoring Docker at Scale

Combinatorial multiplication

160 410 metrics per instance

metrics per instance

Assuming only 5 containers per instance

Page 28: Running & Monitoring Docker at Scale

Velocity

hours, days, months

minutes, hours, days

EC2 instance half-life Container half-life

Page 29: Running & Monitoring Docker at Scale

Aggravating factors •  Hub-based provisioning

–  new images every day •  Autonomic orchestration

–  from imperative to declarative –  automated –  individual containers don’t matter –  e.g. kubernetes, mesos

Page 30: Running & Monitoring Docker at Scale

A lot more, A lot faster.

Page 31: Running & Monitoring Docker at Scale

If your monitoring is still centered on individual hosts or instances…

Page 32: Running & Monitoring Docker at Scale

Host-centric monitoring

Monitor

Monitor

GAP

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 33: Running & Monitoring Docker at Scale

A lot more pain, A lot faster.

Page 34: Running & Monitoring Docker at Scale

Monitoring containers effectively

Page 35: Running & Monitoring Docker at Scale

A new approach to container monitoring

Page 36: Running & Monitoring Docker at Scale

Layers + Tags

Page 37: Running & Monitoring Docker at Scale

Layers of monitoring

Monitor

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 38: Running & Monitoring Docker at Scale

Layers of monitoring

CloudWatch

Infrastructure Monitoring

APM

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 39: Running & Monitoring Docker at Scale

Layers of monitoring

cpu/net/io

filesystem docker mem docker cpu db queries

web requests

app throughput

CloudWatch

Infrastructure Monitoring

APM

e.g.

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 40: Running & Monitoring Docker at Scale

Layers of monitoring •  Access to metrics from all the layers •  Amazon CloudWatch, OS metrics, Docker metrics,

app metrics in 1 place •  Shared timeline

Page 41: Running & Monitoring Docker at Scale

If your monitoring does not cover all layers, pain.

Page 42: Running & Monitoring Docker at Scale

Tags

You use them already

Page 43: Running & Monitoring Docker at Scale

Tags •  Monitoring is like Auto-Scaling Groups •  Monitoring is like Docker orchestration •  From imperative to declarative •  Query-based •  Queries operate on tags

Page 44: Running & Monitoring Docker at Scale

Monitoring with tags and queries

“Monitor all Docker containers running image web” “… in region us-west-2 across all availability zones” “… and make sure resident set size < 1GB on c3.xl”

Page 45: Running & Monitoring Docker at Scale

Monitoring with tags and queries

“Monitor all Docker containers running image web” “… in region us-west-2 across all availability zones” “… and make sure resident set size < 1GB on c3.xl”

Page 46: Running & Monitoring Docker at Scale

Monitoring with tags and queries

“Monitor all Docker containers running image web” “… in region us-west-2 across all availability zones” “… that use more than 1.5x the average on c3.xl”

Page 47: Running & Monitoring Docker at Scale

“Dude, where’s my server?”

Page 48: Running & Monitoring Docker at Scale

“Dude, where’s my container?”

Page 49: Running & Monitoring Docker at Scale

If your monitoring is not tag-based, pain.

Page 50: Running & Monitoring Docker at Scale

Demo

Page 51: Running & Monitoring Docker at Scale

Take-aways 1. Docker increases operational complexity by an order

of magnitude unless… 2. You have layered monitoring, from the instance to

the container and to the application, and… 3. You monitor using tags and queries

Page 52: Running & Monitoring Docker at Scale

Please give us your feedback on this presentation

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Join the conversation on Twitter with #reinvent