making your logs work for you: drupal escalation and disaster recovery
TRANSCRIPT
When sh*t hits the fan, what do you do?
How to make your logs work for you in times of website sadness
DrupalCon LA 2015
Pantheon.io
Most common causes of downtime
30%Weather/
Environment
33%IT/
Equipment
34%Cyber Attack
Pantheon.io
The top cause of unplanned outages?
48%Human Error
52% of those surveyed believe ALL or most of unplanned outages can be avoided.
What does downtime cost you and your business?
Pantheon.io
How does downtime affect your bottom line?
37%
22%
15%
10%
9%
7%
Cost associated with reputation and brand damage
Revenues lost because of system availability problems
Loss of user productivity and increased frustration
Cost associated with compliance or regulatory failure
Cost of forensics to determine the root causes of disruptions
Cost of technical support to restore systems to an operational state
Pantheon.io
Meet the Speaker
Timani Tunduwani
Customer Support Manager
PANTHEON RUNS 100,000 WEBSITES
WE DO THIS ALL THE TIME
Pantheon.io
1. Overview2. Logging in Drupal & PHP3. Incident Planning & Management4. Live Demo5. Questions
Agenda
Pantheon.io
Website is down. Why?How do we get it back up?
Is the infrastructure down?
Is Drupal Sad and Broken again??
What’s going on?
Do I fix it or Pantheon?
WTF?! FIX IT?
What happens when your website is down?
EVERY SECOND YOU DON’T KNOW
IS ANOTHER SECOND YOUR
WEBSITE IS DOWN
Pantheon.io
What would you do if your website went down… right now?
Website Owner
Who do I contact?
YOU NEED A PLAN! SERIOUSLY!
Website Developer Project Manager Drupal Support & Maintenance Team
Application log management
<?php watchdog($type, $message, $severity = WATCHDOG_NOTICE, $link = NULL); ?>
Pantheon.io
1. Standardize2. Centralize3. Aggregate4. Analyze5. Alert
A 5 step plan for success
Pantheon.io
1. Semi-arbitrary log format2. Drupal 8 using PSR33. Can not have saved searches beyond sticky search4. No reporting dashboard for post mortem5. No stack traces6. Not portable. Have you tried to export the watchdog table?
Current limitations of watchdog
Pantheon.io
MariaDB [pantheon]> select wid, type, message, variables from watchdog limit 100 \G
*************************** 1. row ***************************
wid: 1830682
type: php
message: %type: !message in %function (line %line of %file).
variables: a:6:{s:5:"%type";s:6:"Notice";s:8:"!message";s:26:"Undefined index:
authorize";s:9:"%function";s:40:"FeedsEntityProcessor->entitySaveAccess()";s:5:"%file";s:
108:"
/srv/bindings/aa7491e7ef954a8fb4f9dc41abccab80/code/sites/all/modules/feeds/plugins/Feeds
EntityProcessor.inc";s:5:"%line";i:77;s:14:"severity_level";i:5;}
Drupal watchdog table
Pantheon.io
1. PHP Framework Interop Group (PHP Fig)
2. Monolog2.1. Chain of responsibility
logging pattern2.2. Core concepts
Overview
Pantheon.io
Proposing a Standards Recommendation (PSR)
❏ PSR 0: added-spl-autoload-register❏ PSR-1: Basic-coding-standard❏ PSR-2: Coding-style-guide-meta❏ PSR-3: Logger-interface❏ PSR-4: Autoloader-examples
Pantheon.io
PSR-3 : A common interface for logging libraries
The goal is to allow libraries to receive a Psr\Log\LoggerInterface object and write logs to it in a simple and universal way.
Pantheon.io
Logging Levels - RFC 5424
Error Level Code Description
DEBUG 100 Detailed debug information.
INFO 200 Interesting events. Examples: User logs in, SQL logs.
NOTICE 250 Normal but significant events.
WARNING 300 Exceptional occurrences that are not errors.
Error 400 Runtime errors that do not require immediate action.
Critical 500 Critical conditions.
Alert 550 Action must be taken immediately.
Emergency 600 Emergency: system is unusable.
Pantheon.io
Monolog + Composer + Drupal
Monolog sends your logs to files, sockets, inboxes, databases and various web services
Pantheon.io
Chain of responsibility pattern
Pantheon.io
Core Concepts
1. Logger2. Handler3. Log Levels4. Formatter5. Processor6. Utilities
Log system overview
Application Performance Monitoring
Pantheon.io
Centralizing application logs
Monolog
Pantheon.io
• On-Call Scheduling• Auto-Escalation• International Reach• Collaboration• Advanced Analytics
Features
• Reliability• Monitoring Aggregation• Easy Setup• Effective Alerting• Full stack visibility
Cloud-based centralized log management
Pantheon.io
Centralizing application logs
Monolog
Pantheon.io
• Built-in alerting• Customized dashboards• Persistent workspaces• Multiple integrations available• Advanced Analytics• Overage protection
• Agentless log collection• Centralized logging• Supports multiple log formats• Automated event parsing• Powerful search capabilities• Unlimited saved searches
Features
Incident Management
Pantheon.io
Incident Response Goals
1. Verify that an incident occurred.2. Maintain or Restore Business Continuity.3. Reduce the incident impact.4. Determine how the attack was done or the incident happened.5. Prevent future attacks or incidents.6. Improve security and incident response.7. Prosecute illegal activity.8. Keep management informed of the situation & response
Pantheon.io
Incident planning
Step 1Form a Collaborative Planning Team
Step 2Understand the Situation
Step 3Determine Goals and Objectives
Step 4Plan Development
Step 5Plan Prep, Review & Approval
Step 6Plan Implementation& Maintenance
Incident managment system
IT incidents management platform
Pantheon.io
1. Reliability2. Monitoring Aggregation3. Easy Setup4. Effective Alerting5. Mobile Incident Management6. Escalation Policies
1. On-Call Scheduling2. Auto-Escalation3. International Reach4. Collaboration5. Advanced Analytics
Features
Slack HQ communication platform
Live Demo #1Time to break something. YAY!
Fin!