practical monitoring techniques
TRANSCRIPT
Practical Monitoring Techniques
Today's Talk● Our Mission● Current Tools● Increasing Coverage● PD Schedules● Automatic Self Healing● Bots And Alerts channels● Events Dashboard● Dashboard Accessibility● Best Practices Summary
Our Mission
Back up culture with the proper tools to support it
Current Tools
● Metrics collections: Collectd, statsd, Cloudwatch● Monitoring: Sensu, NewRelic● Alert channels: PagerDuty, emails, slack● Dashboards: Grafana, CloudWatch, NewRelic● Application testing: E2E Testing System● Internal tools: Sensu mobile, events system,
Sensu bar and more
Increasing Coverage● Automatic collection of basic
system and 3rd party metrics for new instances
● Add alerts automatically for new instance of existed subscriber
● Each Developer / DevOps is responsible for monitoring his application / infrastructure
● Easy method to add new alerts and dashboards
● Automatic events flow
Pager Schedules
● Divided into logical groups of ownership● Schedule has escalation point
● On call should be able to connect and respond to issues in his area
● Easy method to override schedule ● Ability to contact relevant on call
● Ability to page relevant on call
Automatic Self Healing
● Better MTTR● Avoid waking On Call if
possible
● Log activity to float recurrent issues
● Limit the healing to avoid restart loops
● Make sure to sync Healer Alert↔
Bots, Integrations and Alerts Channels
● Alerts channels: Emails, slack, PD mobile, sms, calls● Integrations: Sensu to PD/Slack, CloudWatch to PD,
3rd party (EX: CouchBase, NewRelic, etc) to PD,
● Slack Bot:
Events Dashboard
● Simple Rest API for sending events● Clean timeline view to spot production events● Connections between events (“depends on” and “dependents”)● Detailed view for each event
Accessibility
● Available from everywhere by mobile ● Easy to ack, resolve, mute alerts● Slack bots to reach help● Automatically get graph with the alert● Ability to search, edit, copy, etc alerts● Treat alerts management as code (SVC, DB,
backups, etc)
Best Practices Summary
● Share the pain● Automate base metrics● Automate healing● Make help reachable● Make it easy to add alerts and dashboards● Use warning levels as soft events to avoid phone calls at night● Automate graphs in alerts● Positive alerting system check each day● Dependencies between alerts● Postmortems
Questions