puppetconf 2016: an introduction to measuring and tuning pe performance – charlie sharpsteen,...

Keeping an Eye on the PE StackAn Introduction to Measuring and Tuning PE Performance Charlie Sharpsteen, Puppet Inc.

Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All

Overview

• How do I measure PE performance? What sources of data are available?

• What numbers are actually important? • What settings can I adjust when important metrics

start showing unhealthy trends?

2

3

Gathering Data From PE ServicesJVM Logging and Metrics

PE Server Components

TrapperKeeper JVM Puppet Server

PuppetDB Console Services

Orchestration Services

JVM ActiveMQ

Other PostgreSQL

NGINX

Mostly Java based with shared logging and metrics interfaces.

4

TrapperKeeper Logging

• Configuration for main logs can be found in: /etc/puppetlabs/<service name>/logback.xml

• Controls output destinations, log levels and message formatting.

• Ship to a log aggregator to provide context for investigations.

• Default log pattern is: Date Level [Java Namespace] message

• Puppet Server also includes thread ID: Date Level [thread] [Java Namespace] message

• Thread ID is useful for grouping activity related to a single request.

5

TrapperKeeper Logging

• Configuration for main logs can be found in: /etc/puppetlabs/<service name>/request-logging.xml

• Default format is Apache Combined Log + request duration

• Easily parsed by most log processors.

• Can add additional bits of information such as request headers.

6

TrapperKeeper Metrics

• Metrics are recorded using JMX MBeans.

• Metrics that measure activity over time are weighted to represent the last 5 minutes.

• Metrics can be retrieved via the JMX protocol.

• Full access to all available metrics and all available measurements.

• Can attach tools such as JConsole and JVisualVM.

• Requires additional ports to be opened, configuration can be complex. Java tools only.

• Metrics can be retrieved as JSON over HTTP:

• For a curated set of common metrics: status/v1?level=debug

• For access to all available metrics: metrics/v1/mbeans

7

TrapperKeeper Configuration

• Configuration files are stored under: /etc/puppetlabs/<service name>/conf.d

• Most important settings are managed by puppet_enterprise::profile classes and are tunable via the Console and Hiera.

• JVM settings are specified in /etc/sysconfig or /etc/default

• JVM memory limit, -Xmx is the primary tunable setting. Enable the G1 garbage collector when using limits higher than 10 GB: -XX:+UseG1GC

• These flags are configurable via the java_args parameter on profile classes.

8

Puppet ServerIt’s all about the JRubies.

9


Puppet Server Metrics Overview

● JVM resource usage: status-service

● JMX namespace: java.lang:*

● HTTP request times per endpoint: pe-master

● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.http.*

● Catalog Compilation metrics: pe-puppet-profiler

● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.compiler.* puppetserver:name=puppetlabs.<fqdn>.functions.* puppetserver:name=puppetlabs.<fqdn>.puppetdb.*

● JRuby Metrics: pe-jruby-metrics

● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.jruby.*

10


New PE 2016.4.0 Features

● The metrics/v1/mbeans endpoint has been added to Puppet Server. Must be enabled via Hiera: puppet_enterprise::master::puppetserver::metrics_webservice_enabled: true

● The Graphite metrics reporter has been optimized and extended:

● Only a subset of available metrics are reported by default.

● Reported metrics can be customized using the metrics_puppetserver_metrics_allowed parameter of the puppet_enterprise::profile::master class.

11


JRuby Metrics

● Almost all Puppet Server requests must be handled by a JRuby instance — this makes JRuby availability the primary performance bottleneck.

● num-free-jrubies

● Measures spare capacity for incoming requests.

● average-wait-time

● Should never grow to a significant fraction of HTTP request times.

● Impacted by agent checkin distribution, resource availability, Puppet plugins and code.

12


Agent Checkin Activity

● Agents will check in runinterval after starting their last run — this can lead to pile-ups or “thundering herds”. Be careful of:

● Starting or re-starting a group of agents without the splay setting enabled.

● Triggering a group of agent runs via: mco puppet runonce

● Monitor average-requested-jrubies and Puppet Server access logs for spikes in agent activity.

● Use PostgreSQL to pull a histogram of Agent start times from report data:sudo su - pe-postgres -s /bin/bash -c "psql -d pe-puppetdb" SELECT date_part('minute', start_time), count(*) FROM reports WHERE start_time BETWEEN '2016-10-20 13:30:00' AND '2015-10-20 14:30:00' GROUP BY date_part('minute', start_time) ORDER BY date_part('minute', start_time) ASC;

13


Re-balancing Agent Checkins

● Use MCollective to orchestrate a batched re-start:su - peadmin -c "mco rpc service stop service=puppet" su - peadmin -c "mco rpc service start service=puppet --batch 1 \ --batch-sleep <runinterval in seconds / #nodes>”

● Batching is not necessary if the agents have splay enabled.

● For a stable distribution that isn’t affected by re-starts, puppet agent -t can be run on a schedule determined by the fqdn_rand() function instead of using the service.

● Load due to agent activity can be cut dramatically by shifting to the Direct Puppet workflow where Orchestrator or MCollective are used to push catalog updates.

14


Adding More JRuby Capacity

● JRuby count is set via jruby_max_active_instances, constrained by available CPU and RAM:

● Compile masters tend to top out around NCPU - 1. Monolithic masters need to share with PuppetDB and tend more towards (NCPU / 2 - 1).

● RAM requirements are 512 MB per JRuby, but may need to be increased if catalog compilation uses large datasets or dozens of environments are in use.

● The environment_timeout setting can be used to reduce the CPU requirements of catalog compilation. Set to 0 globally and unlimited for long-lived environments with lots of agents.

● Each environment using an unlimited timeout will add to the per-JRuby RAM requirements.Monitor memory usage of pre-2016.4.0 installations closely when using unlimited timeouts.

● Code Manager should be enabled when an unlimited timeout is used so that caches are flushed when new code is deployed.

15


Investigating Compile Times

● PE Puppet Server tracks compilation time on several different levels: per-node, per-environment, per-resource, per-function, and more.

● Top 10 resources and functions are available via the status API and Puppet Server performance dashboard: https://<puppetmaster>:8140/puppet/experimental/dashboard.html

● Full access available through JMX and the metrics API.

● Detailed timing on catalog compilation can be obtained by setting the Puppet Server log level to DEBUG and running puppet agent -t --profile on nodes of interest.

16


Investigating Agent Run Times

● Agent run summaries are stored at: /opt/puppetlabs/puppet/cache/state/last_run_summary.yaml

● Summaries are also stored by PuppetDB and can be viewed from the PE Console, or queried: reports[metrics] { latest_report? = true and certname = '<node name>' }

● The time section shows amount of time taken per resource type along with config_retrieval measuring the amount of time it took to receive a catalog.

● Per-resource timing can be logged by running: puppet agent -t --evaltrace

17

PuppetDBProcessing Time and Storage Space

18


PuppetDB Storage Usage

● Monitor disk space! /opt/puppetlabs/server/data/postgresql/ /opt/puppetlabs/server/data/puppetdb/

● If disk space runs out, there are two options for returning space to the operating system:

● The existing volume can be enlarged so that a VACUUM FULL can be run.

● Alternately, a new volume can be attached for a database backup and restore.

● The primary source of disk usage is report storage, this can be tuned by setting: report-ttl

● For infrastructure with high node turnover, consider setting node-purge-ttl to remove data related to decommissioned nodes.

19


PuppetDB Command Processing

● Every PuppetDB operation, aside from queries, is executed by an asynchronous command processing queue. This queue is managed by an internal ActiveMQ server:org.apache.activemq:type=Broker,brokerName=localhost, destinationType=Queue,destinationName=puppetlabs.puppetdb.commands

● Important metrics:

● Backlog of commands waiting for processing: QueueSize

● Largest command seen: MaxMessageSize

● Available memory for in-flight commands: MemoryPercentUsage

● Increase PuppetDB heap size along with the command-processing.memory-usage setting if the percentage spikes close to 100%. This will prevent ActiveMQ from paging commands to disk.

20


PuppetDB Command Processing

● Command processing rates: puppetlabs.puppetdb.mq:name=global.processing-time puppetlabs.puppetdb.storage:name=replace-facts-time puppetlabs.puppetdb.storage:name=replace-catalog-time puppetlabs.puppetdb.storage:name=store-report-time

● Additional processing threads can be added using the command-processing.threads setting.

● On a monolithic install, PuppetDB processing threads must be balanced against Puppet Server JRubies and the number of CPU cores available.

21


PostgreSQL Query Performance

● PostgreSQL configuration can be found in: /opt/puppetlabs/server/data/postgresql/9.4/data/postgresql.conf

● Add settings to improve logging around slow queries: log_min_duration_statement = 3000ms log_temp_files = 0

● If a temp file shows up in the logs, that means Postgres had to perform an operation outside of RAM; which is slow. Consider increasing the work_mem setting to be greater than the size of the temp files used.

● If query performance has been dropping over time, a database VACCUM may be needed: su - pe-postgres -s /bin/bash -c "vacuumdb --analyze --verbose --all"

22

ResourcesThis Slide Deck: https://goo.gl/ytzCA5

23

https://goo.gl/ytzCA5

Resources

Logging:

• Directing Output: http://logback.qos.ch/manual/appenders.html

• Formatting Main Logs: http://logback.qos.ch/manual/layouts.html

• Formatting Access Logs: http://logback.qos.ch/manual/layouts.html#logback-access

JMX:

• Configuration: https://docs.oracle.com/javase/8/docs/technotes/guides/management/agent.html

• Metric Polling Tool: https://github.com/jmxtrans/jmxtrans

24

http://logback.qos.ch/manual/appenders.html

http://logback.qos.ch/manual/layouts.html

http://logback.qos.ch/manual/layouts.html#logback-access

https://docs.oracle.com/javase/8/docs/technotes/guides/management/agent.html

https://github.com/jmxtrans/jmxtrans

Resources

Puppet Server:

• Metrics Reference: https://docs.puppet.com/pe/2016.4/puppet_server_metrics.html

• Configuration Reference: https://docs.puppet.com/puppetserver/2.6/configuration.html

• Direct Puppet Workflow: https://docs.puppet.compe/2016.4/direct_puppet_workflow.html

PuppetDB:

• Metrics Reference: https://docs.puppet.com/puppetdb/4.2/api/metrics/v1/mbeans.html

• Configuration Reference: https://docs.puppet.com/puppetdb/4.2/configure.html

• Backup Procedures: https://docs.puppet.com/pe/2016.4/maintain_console-db.html

• PostgreSQL Maintenance: https://github.com/npwalker/pe_databases

25

https://docs.puppet.com/pe/2016.4/puppet_server_metrics.html

https://docs.puppet.com/puppetserver/2.6/configuration.html

https://docs.puppet.compe/2016.4/direct_puppet_workflow.html

https://docs.puppet.com/puppetdb/4.2/api/metrics/v1/mbeans.html

https://docs.puppet.com/puppetdb/4.2/configure.html

https://docs.puppet.com/pe/2016.4/maintain_console-db.html

https://github.com/npwalker/pe_databases

puppetconf 2016: an introduction to measuring and tuning pe performance – charlie sharpsteen,...

Technology