puppetconf 2016: an introduction to measuring and tuning pe performance – charlie sharpsteen,...

26
Keeping an Eye on the PE Stack An Introduction to Measuring and Tuning PE Performance Charlie Sharpsteen, Puppet Inc.

Upload: puppet

Post on 10-Jan-2017

73 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Keeping an Eye on the PE StackAn Introduction to Measuring and Tuning PE Performance Charlie Sharpsteen, Puppet Inc.

Page 2: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

Overview

• How do I measure PE performance? What sources of data are available?

• What numbers are actually important? • What settings can I adjust when important metrics

start showing unhealthy trends?

2

Page 3: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

3

Gathering Data From PE ServicesJVM Logging and Metrics

Page 4: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

PE Server Components

TrapperKeeper JVM Puppet Server

PuppetDB Console Services

Orchestration Services

JVM ActiveMQ

Other PostgreSQL

NGINX

Mostly Java based with shared logging and metrics interfaces.

4

Page 5: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

TrapperKeeper Logging

• Configuration for main logs can be found in: /etc/puppetlabs/<service name>/logback.xml

• Controls output destinations, log levels and message formatting.

• Ship to a log aggregator to provide context for investigations.

• Default log pattern is: Date Level [Java Namespace] message

• Puppet Server also includes thread ID: Date Level [thread] [Java Namespace] message

• Thread ID is useful for grouping activity related to a single request.

5

Page 6: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

TrapperKeeper Logging

• Configuration for main logs can be found in: /etc/puppetlabs/<service name>/request-logging.xml

• Default format is Apache Combined Log + request duration

• Easily parsed by most log processors.

• Can add additional bits of information such as request headers.

6

Page 7: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

TrapperKeeper Metrics

• Metrics are recorded using JMX MBeans.

• Metrics that measure activity over time are weighted to represent the last 5 minutes.

• Metrics can be retrieved via the JMX protocol.

• Full access to all available metrics and all available measurements.

• Can attach tools such as JConsole and JVisualVM.

• Requires additional ports to be opened, configuration can be complex. Java tools only.

• Metrics can be retrieved as JSON over HTTP:

• For a curated set of common metrics: status/v1?level=debug

• For access to all available metrics: metrics/v1/mbeans

7

Page 8: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

TrapperKeeper Configuration

• Configuration files are stored under: /etc/puppetlabs/<service name>/conf.d

• Most important settings are managed by puppet_enterprise::profile classes and are tunable via the Console and Hiera.

• JVM settings are specified in /etc/sysconfig or /etc/default

• JVM memory limit, -Xmx is the primary tunable setting. Enable the G1 garbage collector when using limits higher than 10 GB: -XX:+UseG1GC

• These flags are configurable via the java_args parameter on profile classes.

8

Page 9: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Puppet ServerIt’s all about the JRubies.

9

Page 10: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

Puppet Server Metrics Overview

● JVM resource usage: status-service

● JMX namespace: java.lang:*

● HTTP request times per endpoint: pe-master

● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.http.*

● Catalog Compilation metrics: pe-puppet-profiler

● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.compiler.* puppetserver:name=puppetlabs.<fqdn>.functions.* puppetserver:name=puppetlabs.<fqdn>.puppetdb.*

● JRuby Metrics: pe-jruby-metrics

● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.jruby.*

10

Page 11: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

New PE 2016.4.0 Features

● The metrics/v1/mbeans endpoint has been added to Puppet Server. Must be enabled via Hiera: puppet_enterprise::master::puppetserver::metrics_webservice_enabled: true

● The Graphite metrics reporter has been optimized and extended:

● Only a subset of available metrics are reported by default.

● Reported metrics can be customized using the metrics_puppetserver_metrics_allowed parameter of the puppet_enterprise::profile::master class.

11

Page 12: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

JRuby Metrics

● Almost all Puppet Server requests must be handled by a JRuby instance — this makes JRuby availability the primary performance bottleneck.

● num-free-jrubies

● Measures spare capacity for incoming requests.

● average-wait-time

● Should never grow to a significant fraction of HTTP request times.

● Impacted by agent checkin distribution, resource availability, Puppet plugins and code.

12

Page 13: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

Agent Checkin Activity

● Agents will check in runinterval after starting their last run — this can lead to pile-ups or “thundering herds”. Be careful of:

● Starting or re-starting a group of agents without the splay setting enabled.

● Triggering a group of agent runs via: mco puppet runonce

● Monitor average-requested-jrubies and Puppet Server access logs for spikes in agent activity.

● Use PostgreSQL to pull a histogram of Agent start times from report data:sudo su - pe-postgres -s /bin/bash -c "psql -d pe-puppetdb" SELECT date_part('minute', start_time), count(*) FROM reports WHERE start_time BETWEEN '2016-10-20 13:30:00' AND '2015-10-20 14:30:00' GROUP BY date_part('minute', start_time) ORDER BY date_part('minute', start_time) ASC;

13

Page 14: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

Re-balancing Agent Checkins

● Use MCollective to orchestrate a batched re-start:su - peadmin -c "mco rpc service stop service=puppet" su - peadmin -c "mco rpc service start service=puppet --batch 1 \ --batch-sleep <runinterval in seconds / #nodes>”

● Batching is not necessary if the agents have splay enabled.

● For a stable distribution that isn’t affected by re-starts, puppet agent -t can be run on a schedule determined by the fqdn_rand() function instead of using the service.

● Load due to agent activity can be cut dramatically by shifting to the Direct Puppet workflow where Orchestrator or MCollective are used to push catalog updates.

14

Page 15: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

Adding More JRuby Capacity

● JRuby count is set via jruby_max_active_instances, constrained by available CPU and RAM:

● Compile masters tend to top out around NCPU - 1. Monolithic masters need to share with PuppetDB and tend more towards (NCPU / 2 - 1).

● RAM requirements are 512 MB per JRuby, but may need to be increased if catalog compilation uses large datasets or dozens of environments are in use.

● The environment_timeout setting can be used to reduce the CPU requirements of catalog compilation. Set to 0 globally and unlimited for long-lived environments with lots of agents.

● Each environment using an unlimited timeout will add to the per-JRuby RAM requirements.Monitor memory usage of pre-2016.4.0 installations closely when using unlimited timeouts.

● Code Manager should be enabled when an unlimited timeout is used so that caches are flushed when new code is deployed.

15

Page 16: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

Investigating Compile Times

● PE Puppet Server tracks compilation time on several different levels: per-node, per-environment, per-resource, per-function, and more.

● Top 10 resources and functions are available via the status API and Puppet Server performance dashboard: https://<puppetmaster>:8140/puppet/experimental/dashboard.html

● Full access available through JMX and the metrics API.

● Detailed timing on catalog compilation can be obtained by setting the Puppet Server log level to DEBUG and running puppet agent -t --profile on nodes of interest.

16

Page 17: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

Investigating Agent Run Times

● Agent run summaries are stored at: /opt/puppetlabs/puppet/cache/state/last_run_summary.yaml

● Summaries are also stored by PuppetDB and can be viewed from the PE Console, or queried: reports[metrics] { latest_report? = true and certname = '<node name>' }

● The time section shows amount of time taken per resource type along with config_retrieval measuring the amount of time it took to receive a catalog.

● Per-resource timing can be logged by running: puppet agent -t --evaltrace

17

Page 18: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

PuppetDBProcessing Time and Storage Space

18

Page 19: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

PuppetDB Storage Usage

● Monitor disk space! /opt/puppetlabs/server/data/postgresql/ /opt/puppetlabs/server/data/puppetdb/

● If disk space runs out, there are two options for returning space to the operating system:

● The existing volume can be enlarged so that a VACUUM FULL can be run.

● Alternately, a new volume can be attached for a database backup and restore.

● The primary source of disk usage is report storage, this can be tuned by setting: report-ttl

● For infrastructure with high node turnover, consider setting node-purge-ttl to remove data related to decommissioned nodes.

19

Page 20: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

PuppetDB Command Processing

● Every PuppetDB operation, aside from queries, is executed by an asynchronous command processing queue. This queue is managed by an internal ActiveMQ server:org.apache.activemq:type=Broker,brokerName=localhost, destinationType=Queue,destinationName=puppetlabs.puppetdb.commands

● Important metrics:

● Backlog of commands waiting for processing: QueueSize

● Largest command seen: MaxMessageSize

● Available memory for in-flight commands: MemoryPercentUsage

● Increase PuppetDB heap size along with the command-processing.memory-usage setting if the percentage spikes close to 100%. This will prevent ActiveMQ from paging commands to disk.

20

Page 21: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

PuppetDB Command Processing

● Command processing rates: puppetlabs.puppetdb.mq:name=global.processing-time puppetlabs.puppetdb.storage:name=replace-facts-time puppetlabs.puppetdb.storage:name=replace-catalog-time puppetlabs.puppetdb.storage:name=store-report-time

● Additional processing threads can be added using the command-processing.threads setting.

● On a monolithic install, PuppetDB processing threads must be balanced against Puppet Server JRubies and the number of CPU cores available.

21

Page 22: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

PostgreSQL Query Performance

● PostgreSQL configuration can be found in: /opt/puppetlabs/server/data/postgresql/9.4/data/postgresql.conf

● Add settings to improve logging around slow queries: log_min_duration_statement = 3000ms log_temp_files = 0

● If a temp file shows up in the logs, that means Postgres had to perform an operation outside of RAM; which is slow. Consider increasing the work_mem setting to be greater than the size of the temp files used.

● If query performance has been dropping over time, a database VACCUM may be needed: su - pe-postgres -s /bin/bash -c "vacuumdb --analyze --verbose --all"

22

Page 23: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

ResourcesThis Slide Deck: https://goo.gl/ytzCA5

23

Page 24: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Resources

Logging:

• Directing Output: http://logback.qos.ch/manual/appenders.html

• Formatting Main Logs: http://logback.qos.ch/manual/layouts.html

• Formatting Access Logs: http://logback.qos.ch/manual/layouts.html#logback-access

JMX:

• Configuration: https://docs.oracle.com/javase/8/docs/technotes/guides/management/agent.html

• Metric Polling Tool: https://github.com/jmxtrans/jmxtrans

24

Page 25: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet

Resources

Puppet Server:

• Metrics Reference: https://docs.puppet.com/pe/2016.4/puppet_server_metrics.html

• Configuration Reference: https://docs.puppet.com/puppetserver/2.6/configuration.html

• Direct Puppet Workflow: https://docs.puppet.compe/2016.4/direct_puppet_workflow.html

PuppetDB:

• Metrics Reference: https://docs.puppet.com/puppetdb/4.2/api/metrics/v1/mbeans.html

• Configuration Reference: https://docs.puppet.com/puppetdb/4.2/configure.html

• Backup Procedures: https://docs.puppet.com/pe/2016.4/maintain_console-db.html

• PostgreSQL Maintenance: https://github.com/npwalker/pe_databases

25

Page 26: PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet