server administration in python with fabric, cuisine and watchdog
TRANSCRIPT
ffunctioninc.
Fabric, Cuisine & Watchdog
Sébastien Pierre, ffunction inc.@Montréal Python, February 2011
www.ffctn.com
ffunctioninc.
How to use Python for
Server AdministrationThanks to
FabricCuisine*
& Watchdog**custom tools
ffunctioninc.
WEBSERVER
The era of dedicated servers
DATABASESERVER
EMAILSERVER
Hosted in your server room or in colocation
ffunctioninc.
WEBSERVER
The era of dedicated servers
DATABASESERVER
EMAILSERVER
Hosted in your server room or in colocation
Sysadmins typicallySSH and configure
the servers live
Sysadmins typicallySSH and configure
the servers live
ffunctioninc.
WEBSERVER
The era of dedicated servers
DATABASESERVER
EMAILSERVER
Hosted in your server room or in colocation
The servers areconservatively managed,
updates are risky
The servers areconservatively managed,
updates are risky
ffunctioninc.
SLICE 1
The era of slices/VPS
SLICE 10
Linode.com
SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6
Amazon Ec2
We now have multiplesmall virtual servers
(slices/VPS)
We now have multiplesmall virtual servers
(slices/VPS)
ffunctioninc.
SLICE 1
The era of slices/VPS
SLICE 10
Linode.com
SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6
Amazon Ec2
Often located in differentdata-centers
Often located in differentdata-centers
ffunctioninc.
SLICE 1
The era of slices/VPS
SLICE 10
Linode.com
SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6
Amazon Ec2
...and sometimes withdifferent providers
...and sometimes withdifferent providers
ffunctioninc.
SLICE 1
The era of slices/VPS
SLICE 10
Linode.com
SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6
Amazon Ec2
DEDICATEDSERVER 1
DEDICATEDSERVER 2
IWeb.com
We even sometimesstill have physical,dedicated servers
We even sometimesstill have physical,dedicated servers
ffunctioninc.
The challenge
ORDERSERVER
SETUPSERVER
Create users, groupsCustomize config filesInstall base packages
Create users, groupsCustomize config filesInstall base packages
ffunctioninc.
The challenge
ORDERSERVER
SETUPSERVER
DEPLOYAPPLICATION
Install app-specificpackages
deploy applicationstart services
Install app-specificpackages
deploy applicationstart services
ffunctioninc.
The challenge
ORDERSERVER
SETUPSERVER
DEPLOYAPPLICATION
MAKE THIS PROCESS AS FAST (AND SIMPLE)AS POSSIBLE
ffunctioninc.
The challenge
Quickly integrate yournew server in the
existing architecture
Quickly integrate yournew server in the
existing architecture
ffunctioninc.
Today's menu
FABRIC
CUISINE
Interact with your remote machinesas if they were local
Takes care of users, group, packagesand configuration of your new machine
ffunctioninc.
Today's menu
FABRIC
CUISINE
WATCHDOG
Interact with your remote machinesas if they were local
Takes care of users, group, packagesand configuration of your new machine
Ensures that your servers and servicesare up and running
ffunctioninc.
Today's menu
FABRIC
CUISINE
WATCHDOG
Interact with your remote machinesas if they were local
Takes care of users, group, packagesand configuration of your new machine
Ensures that your servers and servicesare up and running
Made byMade by
ffunctioninc.
Part 1
Fabric - http://fabfile.org
application deployment & systems administration tasks
ffunctioninc.
Fabric is a Python library and command-line tool
for streamlining the use of SSHfor application deployment
or systems administration tasks.
ffunctioninc.
Fabric is a Python library and command-line tool
for streamlining the use of SSHfor application deployment
or systems administration tasks.
Wait... what doesthat mean ?
Wait... what doesthat mean ?
ffunctioninc.
Streamlining SSH
version = os.popen(“ssh myserver 'cat /proc/version'”).read()
version = run(“cat /proc/version”)
By hand:
Using Fabric:
ffunctioninc.
Streamlining SSH
version = os.popen(“ssh myserver 'cat /proc/version').read()
from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)
By hand:
Using Fabric:
ffunctioninc.
Streamlining SSH
version = os.popen(“ssh myserver 'cat /proc/version').read()
from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)
By hand:
Using Fabric:
You can specify multiple hosts and runthe same commands
across them
You can specify multiple hosts and runthe same commands
across them
ffunctioninc.
Streamlining SSH
version = os.popen(“ssh myserver 'cat /proc/version').read()
from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)
By hand:
Using Fabric:
Connections will belazily created and
pooled
Connections will belazily created and
pooled
ffunctioninc.
Streamlining SSH
version = os.popen(“ssh myserver 'cat /proc/version').read()
from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)
By hand:
Using Fabric:
Failures ($STATUS) willbe handled just like in Make
Failures ($STATUS) willbe handled just like in Make
ffunctioninc.
Example: Installing packages
sudo(“aptitude install nginx”)
if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:
sudo("aptitude install '%s'" % (package)
ffunctioninc.
Example: Installing packages
sudo(“aptitude install nginx”)
if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:
sudo("aptitude install '%s'" % (package)
It's easy to take actiondepending on the result
It's easy to take actiondepending on the result
ffunctioninc.
Example: Installing packages
sudo(“aptitude install nginx”)
if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:
sudo("aptitude install '%s'" % (package)
Note that we add trueso that the run() always
succeeds** there are other ways...
Note that we add trueso that the run() always
succeeds** there are other ways...
ffunctioninc.
Example: retrieving system status
disk_usage = run(“df -kP”)mem_usage = run(“cat /proc/meminfo”)cpu_usage = run(“cat /proc/stat”
print disk_usage, mem_usage, cpu_info
ffunctioninc.
Example: retrieving system status
disk_usage = run(“df -kP”)mem_usage = run(“cat /proc/meminfo”)cpu_usage = run(“cat /proc/stat”
print disk_usage, mem_usage, cpu_info
Very useful for gettinglive information from
many different servers
Very useful for gettinglive information from
many different servers
ffunctioninc.
Fabfile.py
from fabric.api import *from mysetup import *
env.host = [“server1.myapp.com”]
def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()
$ fab setup
ffunctioninc.
Fabfile.py
from fabric.api import *from mysetup import *
env.host = [“server1.myapp.com”]
def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()
$ fab setup
Just like Make, youwrite rules that do
something
Just like Make, youwrite rules that do
something
ffunctioninc.
Fabfile.py
from fabric.api import *from mysetup import *
env.host = [“server1.myapp.com”]
def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()
$ fab setup
...and you can specifyon which servers the rules
will run
...and you can specifyon which servers the rules
will run
ffunctioninc.
Multiple hosts
@hosts(“db1.myapp”)def backup_db():
run(...)
env.hosts = [“db1.myapp.com”,“db2.myapp.com”,“db3.myapp.com”
]
ffunctioninc.
Roles
$ fab -R web setup
env.roledefs = { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2']}
ffunctioninc.
Roles
$ fab -R web setup
env.roledefs = { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2']}
Will run the setup ruleonly on hosts members
of the web role.
Will run the setup ruleonly on hosts members
of the web role.
ffunctioninc.
Some facts about Fabric
Fabric 1.0 just released!On March, 4th 2011
3 years of developmentFirst commit 1161 days ago (on March 10th, 2011)
Related ProjectsOpscode's Chef and Puppet
ffunctioninc.
What's good about Fabric?
Low-levelBasically an ssh() command that returns the result
Simple primitivesrun(), sudo(), get(), put(), local(), prompt(), reboot()
No magicNo DSL, no abstraction, just a remote command API
ffunctioninc.
What could be improved ?
Ease common admin tasksUser, group creation. Files, directory operations.
Abstract primitivesLike install package, so that it works with different OS
TemplatesTo make creating/updating configuration files easy
ffunctioninc.
What is Opscode's Chef?
RecipesScripts/packages to install and configure services and applications
APIA DSL-like Ruby API to interact with the OS (create users, groups, install packages, etc)
ArchitectureClient-server or “solo” mode to push and deploy your new configurations
http://wiki.opscode.com/display/chef/Home
ffunctioninc.
What I liked about Chef
FlexibleYou can use the API or shell commands
StructuredHelped me have a clear decomposition of the services installed per machine
CommunityLots of recipes already available from http://cookbooks.opscode.com/
ffunctioninc.
What I didn't like
Too many files and directoriesCode is spread out, hard to get the big picture
Abstraction overloadAPI not very well documented, frequent fall backs to plain shell scripts within the recipe
No “smart” recipeRecipes are applied all the time, even when it's not necessary
ffunctioninc.
The question that kept coming...
Django recipe: 5 files, 2 directories
sudo aptitude install apache2 python django-python
What it does, in essence
ffunctioninc.
The question that kept coming...
Django recipe: 5 files, 2 directories
sudo aptitude install apache2 python django-python
What it does, in essence
Is this really necessaryfor what I want to do ?
Is this really necessaryfor what I want to do ?
ffunctioninc.
What I loved about Fabric
Bare metalssh() function, simple and elegant set of primitives
No magicNo abstraction, no model, no compilation
Two-way communicationEasy to change the rule's behaviour according to the output (ex: do not install something that's already installed)
ffunctioninc.
What I needed
Fabric
File I/OFile I/O PackageManagement
PackageManagement
User/GroupManagement
User/GroupManagement
ffunctioninc.
What I needed
Fabric
File I/OFile I/O PackageManagement
PackageManagement
User/GroupManagement
User/GroupManagement
Text processing & TemplatesText processing & Templates
ffunctioninc.
How I wanted it
Simple “flat” API[object]_[operation] where operation is something in “create”, “read”, “update”, “write”, “remove”, “ensure”, etc...
Driven by needOnly implement a feature if I have a real need for it
No magicEverything is implemented using sh-compatible commands
No unnecessary structureEverything fits in one file, no imposed file layout
ffunctioninc.
Cuisine: Example fabfile.py
from cuisine import *
env.host = [“server1.myapp.com”]
def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)
$ fab setup
ffunctioninc.
Cuisine: Example fabfile.py
from cuisine import *
env.host = [“server1.myapp.com”]
def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)
$ fab setup
Fabric's core functionsare already imported
Fabric's core functionsare already imported
ffunctioninc.
Cuisine: Example fabfile.py
from cuisine import *
env.host = [“server1.myapp.com”]
def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)
$ fab setup Cuisine's APIcalls
Cuisine's APIcalls
ffunctioninc.
Cuisine : File I/O
● file_exists does remote file exists?● file_read reads remote file● file_write write data to remote file● file_append appends data to remote file● file_attribs chmod & chown● file_remove
ffunctioninc.
Cuisine : File I/O
● file_exists does remote file exists?● file_read reads remote file● file_write write data to remote file● file_append appends data to remote file● file_attribs chmod & chown● file_remove
Supports owner/groupand mode change
Supports owner/groupand mode change
ffunctioninc.
Cuisine : File I/O (directories)
● dir_exists does remote file exists?● dir_ensure ensures that a directory exists● dir_attribs chmod & chown● dir_remove
ffunctioninc.
Cuisine : File I/O +
● file_update(location, updater=lambda _:_)
package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)
ffunctioninc.
Cuisine : File I/O +
● file_update(location, updater=lambda _:_)
package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)
This replaces the values forconfiguration entriesdbpath and logpath
This replaces the values forconfiguration entriesdbpath and logpath
ffunctioninc.
Cuisine : File I/O +
● file_update(location, updater=lambda _:_)
package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)
The remote file will only bechanged if the content
is different
The remote file will only bechanged if the content
is different
ffunctioninc.
Cuisine: User Management
● user_exists does the user exists?● user_create create the user● user_ensure create the user if it doesn't exist
ffunctioninc.
Cuisine: Group Management
● group_exists does the group exists?● group_create create the group● group_ensure create the group if it doesn't exist● group_user_exists does the user belong to the group?● group_user_add adds the user to the group● group_user_ensure
ffunctioninc.
Cuisine: Package Management
● package_exists is the package available ?● package_installed is it installed ?● package_install install the package● package_ensure ... only if it's not installed● package_upgrade upgrades the/all package(s)
ffunctioninc.
Cuisine: Text transformation
text_ensure_line(text, lines)
file_update("/home/user/.profile", lambda _:text_ensure_line(_,
"PYTHONPATH=/opt/lib/python:${PYTHONPATH};""export PYTHONPATH"
))
ffunctioninc.
Cuisine: Text transformation
text_ensure_line(text, lines)
file_update("/home/user/.profile", lambda _:text_ensure_line(_,
"PYTHONPATH=/opt/lib/python:${PYTHONPATH};""export PYTHONPATH"
))
Ensures that the PYTHONPATHvariable is set and exported,
If not, these lines will beappended.
Ensures that the PYTHONPATHvariable is set and exported,
If not, these lines will beappended.
ffunctioninc.
Cuisine: Text transformation
text_replace_line(text, old, new, find=.., process=...)
configuration = local_read("server.conf")for key, value in variables.items():
configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()
)
ffunctioninc.
Cuisine: Text transformation
text_replace_line(text, old, new, find=.., process=...)
configuration = local_read("server.conf")for key, value in variables.items():
configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()
)
Replaces lines that look likeVARIABLE=VALUE
with the actual values from thevariables dictionary.
Replaces lines that look likeVARIABLE=VALUE
with the actual values from thevariables dictionary.
ffunctioninc.
Cuisine: Text transformation
text_replace_line(text, old, new, find=.., process=...)
configuration = local_read("server.conf")for key, value in variables.items():
configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()
)
The process lambda transformsinput lines before comparing
them.
Here the lines are strippedof spaces and of their value.
The process lambda transformsinput lines before comparing
them.
Here the lines are strippedof spaces and of their value.
ffunctioninc.
Cuisine: Text transformation
text_strip_margin(text)
file_write(".profile", text_strip_margin("""|export PATH="$HOME/bin":$PATH|set -o vi"""
))
ffunctioninc.
Cuisine: Text transformation
text_strip_margin(text)
file_write(".profile", text_strip_margin("""|export PATH="$HOME/bin":$PATH|set -o vi"""
))
Everything after the | separatorwill be output as content.
It allows to easily embed texttemplates within functions.
Everything after the | separatorwill be output as content.
It allows to easily embed texttemplates within functions.
ffunctioninc.
Cuisine: Text transformation
text_template(text, variables)
text_template(text_strip_margin("""|cd ${DAEMON_PATH}|exec ${DAEMON_EXEC_PATH}"""
), dict(DAEMON_PATH="/opt/mongodb",DAEMON_EXEC_PATH="/opt/mongodb/mongod"
))
ffunctioninc.
Cuisine: Text transformation
text_template(text, variables)
text_template(text_strip_margin("""|cd ${DAEMON_PATH}|exec ${DAEMON_EXEC_PATH}"""
), dict(DAEMON_PATH="/opt/mongodb",DAEMON_EXEC_PATH="/opt/mongodb/mongod"
))
This is a simple wrapperaround Python (safe)
string.template() function
This is a simple wrapperaround Python (safe)
string.template() function
ffunctioninc.
Cuisine: Goodies
● ssh_keygen generates DSA keys
● ssh_authorize authorizes your key on the remote server
● mode_sudo run() always uses sudo
● upstart_ensure ensures the given daemon is running
& more!
ffunctioninc.
Cuisine Tips: Structuring your rules
BOOTSTRAP
You just received your newVPS, and you want to set itup so that you have a basesystem that you can accesswithout typing a password
You just received your newVPS, and you want to set itup so that you have a basesystem that you can accesswithout typing a password
ffunctioninc.
Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP
You install your users, groups,preferred packages andconfiguration. You alsoinstall you applications.
You install your users, groups,preferred packages andconfiguration. You alsoinstall you applications.
ffunctioninc.
Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP UPDATE
You want to deploy the newversion of the application
you just built
You want to deploy the newversion of the application
you just built
ffunctioninc.
Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP UPDATE
def bootstrap():# Secure SSH, create admin user# Authorize SSH public keys# Remove unwanted packages
ffunctioninc.
Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP UPDATE
def setup():# Create directories (ex: /opt/data, /opt/services, etc)# Create user/groups (ex: apps, services, etc)# Install base tools (ex: screen, fail2ban, zsh, etc)# Edit configuration (ex: profile, inputrc, etc)# Install and run your application
ffunctioninc.
Cuisine Tips: Structuring your rules
BOOTSTRAP SETUP UPDATE
def update():# Download your application update# Freeze/stop the running application# Install the update# Reload/restart your application# Test that everything is OK
ffunctioninc.
Why use Cuisine ?
● Simple API for remote-server manipulationFiles, users, groups, packages
● Shell commands for specific tasks onlyAvoid problems with your shell commands by only using run() for very specific tasks
● Cuisine tasks are not stupid*_ensure() commands won't do anything if it's not necessary
ffunctioninc.
Limitations
● Limited to sh-shellsOperations will not work under csh
● Only written/tested for Ubuntu LinuxContributors could easily port commands
ffunctioninc.
Get started !
On Github:http://github.com/sebastien/cuisine
1 short Python fileDocumented API
ffunctioninc.
(Some of the) existing solutions
Monit, God, Supervisord, UpstartFocus on starting/restarting daemons and services
Munin, CactiFocus on visualization of RRDTool data
CollectdFocus on collecting and publishing data
ffunctioninc.
The ideal tool
Wide spectrumData collection, service monitoring, actions
Easy setup and deploymentNo complex installation or configuration
Flexible server architectureCan monitor local or remote processes
Customizable and extensibleFrom restarting deamons to monitoring whole servers
ffunctioninc.
Hello, Watchdog!
RULE
SERVICE
A service is acollection of
RULES
A service is acollection of
RULES
ffunctioninc.
Hello, Watchdog!
RULE
SERVICE
HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth
Each rule retrievesdata and processes it.Rules can SUCCEED
or FAIL
Each rule retrievesdata and processes it.Rules can SUCCEED
or FAIL
ffunctioninc.
Hello, Watchdog!
RULE
ACTION
SERVICE
HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth
ffunctioninc.
Hello, Watchdog!
RULE
ACTION
SERVICE
HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth
LoggingXMPP, Email notificationsStart/stop process….
ffunctioninc.
Hello, Watchdog!
RULE
ACTION
SERVICE
HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth
LoggingXMPP, Email notificationsStart/stop process….
Actions are boundto rule, triggeredon rule SUCCESS
or FAILURE
Actions are boundto rule, triggeredon rule SUCCESS
or FAILURE
ffunctioninc.
Execution Model
MONITORRULE
(frequency in ms)
SERVICE DEFINITION
Services are registeredin the monitor
Services are registeredin the monitor
ffunctioninc.
Execution Model
MONITORRULE
(frequency in ms)
SERVICE DEFINITION
Rules defined in theservice are executed
every N ms(frequency)
Rules defined in theservice are executed
every N ms(frequency)
Rules defined in theservice are executed
every N ms(frequency)
Rules defined in theservice are executed
every N ms(frequency)
ffunctioninc.
Execution Model
MONITORRULE
(frequency in ms)
ACTION
ACTION
ACTION
SERVICE DEFINITION
SUCCESS FAILURE
ffunctioninc.
Execution Model
MONITORRULE
(frequency in ms)
ACTION
ACTION
ACTION
SERVICE DEFINITION
If the rule SUCCEEDSactions will be
sequentially executed
If the rule SUCCEEDSactions will be
sequentially executed
SUCCESS FAILURE
ffunctioninc.
Execution Model
MONITORRULE
(frequency in ms)
ACTION
ACTION
ACTION
SERVICE DEFINITION
If the rule FAILfailure actions will besequentially executed
If the rule FAILfailure actions will besequentially executed
SUCCESS FAILURE
ffunctioninc.
Monitoring a remote machine
#!/usr/bin/env pythonfrom watchdog import *Monitor(
Service(name = "google-search-latency",monitor = (
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Print("Google search query took more than 50ms")]
))
)).run()
ffunctioninc.
Monitoring a remote machine
#!/usr/bin/env pythonfrom watchdog import *Monitor(
Service(name = "google-search-latency",monitor = (
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Print("Google search query took more than 50ms")]
))
)).run()
A monitor is like the“main” for Watchdog.
It actively monitorsservices.
A monitor is like the“main” for Watchdog.
It actively monitorsservices.
ffunctioninc.
Monitoring a remote machine
#!/usr/bin/env pythonfrom watchdog import *Monitor(
Service(name = "google-search-latency",monitor = (
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Print("Google search query took more than 50ms")]
))
)).run()
Don't forget to callrun() on it
Don't forget to callrun() on it
ffunctioninc.
Monitoring a remote machine
#!/usr/bin/env pythonfrom watchdog import *Monitor(
Service(name = "google-search-latency",monitor = (
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Print("Google search query took more than 50ms")]
))
)).run()
The service monitorsthe rules
The service monitorsthe rules
ffunctioninc.
Monitoring a remote machine
#!/usr/bin/env pythonfrom watchdog import *Monitor(
Service(name = "google-search-latency",monitor = (
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Print("Google search query took more than 50ms")]
))
)).run()
The HTTP ruleallows to test
an URL
The HTTP ruleallows to test
an URL
And we display amessage in case
of failure
And we display amessage in case
of failure
ffunctioninc.
Monitoring a remote machine
#!/usr/bin/env pythonfrom watchdog import *Monitor(
Service(name = "google-search-latency",monitor = (
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Print("Google search query took more than 50ms")]
))
)).run()
If it there is a 4XX orit timeouts, the rulewill fail and displayan error message
If it there is a 4XX orit timeouts, the rulewill fail and displayan error message
ffunctioninc.
Monitoring a remote machine
$ python example-service-monitoring.py
2011-02-27T22:33:18 watchdog --- #0 (runners=1,threads=2,duration=0.57s)2011-02-27T22:33:18 watchdog [!] Failure on HTTP(GET="www.google.ca:80/search?q=watchdog",timeout=0.08) : Socket error: timed outGoogle search query took more than 50ms2011-02-27T22:33:19 watchdog --- #1 (runners=1,threads=2,duration=0.73s)2011-02-27T22:33:20 watchdog --- #2 (runners=1,threads=2,duration=0.54s)2011-02-27T22:33:21 watchdog --- #3 (runners=1,threads=2,duration=0.69s)2011-02-27T22:33:22 watchdog --- #4 (runners=1,threads=2,duration=0.77s)2011-02-27T22:33:23 watchdog --- #5 (runners=1,threads=2,duration=0.70s)
ffunctioninc.
Sending Email Notification
send_email = Email("[email protected]","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"
)
[…]HTTP(
GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
send_email]
)
ffunctioninc.
Sending Email Notification
send_email = Email("[email protected]","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"
)
[…]HTTP(
GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
send_email]
)
The Email rule will sendan email to
[email protected] triggered
The Email rule will sendan email to
[email protected] triggered
ffunctioninc.
Sending Email Notification
send_email = Email("[email protected]","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"
)
[…]HTTP(
GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
send_email]
)
This is how we bind theaction to the rule failure
This is how we bind theaction to the rule failure
ffunctioninc.
Sending Email+Jabber Notification
send_xmpp = XMPP("[email protected]","Watchdog: Google search latency over 80ms","[email protected]", "myspassword"
)
[…]HTTP(
GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
send_email, send_xmpp]
)
ffunctioninc.
Monitoring incident: when something fails repeatedly during a given period of
time
You don't want to benotified all the time,only when it really
matters.
You don't want to benotified all the time,only when it really
matters.
ffunctioninc.
Detecting incidents
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]
)]
)
ffunctioninc.
Detecting incidents
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]
)]
)
An incident is a “smart”action : it will only dosomething when the
condition is met
An incident is a “smart”action : it will only dosomething when the
condition is met
ffunctioninc.
Detecting incidents
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]
)]
)
When at least 5 errors...When at least 5 errors...
ffunctioninc.
Detecting incidents
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]
)]
)
...happen over a 10seconds period
...happen over a 10seconds period
ffunctioninc.
Detecting incidents
HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[
Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]
)]
)
The Incident action willtrigger the given actions
The Incident action willtrigger the given actions
ffunctioninc.
Example: Ensuring a service is running
from watchdog import *Monitor(
Service(name="myservice-ensure-up",monitor=(
HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[
Incident(errors=5,during=Time.s(5),actions=[
Restart("myservice-start.py")])])))).run()
ffunctioninc.
Example: Ensuring a service is running
from watchdog import *Monitor(
Service(name="myservice-ensure-up",monitor=(
HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[
Incident(errors=5,during=Time.s(5),actions=[
Restart("myservice-start.py")])])))).run()
We test if we canGET http://localhost:8000
within 500ms
We test if we canGET http://localhost:8000
within 500ms
ffunctioninc.
Example: Ensuring a service is running
from watchdog import *Monitor(
Service(name="myservice-ensure-up",monitor=(
HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[
Incident(errors=5,during=Time.s(5),actions=[
Restart("myservice-start.py")])])))).run()
If we can't reach it during5 seconds
If we can't reach it during5 seconds
ffunctioninc.
Example: Ensuring a service is running
from watchdog import *Monitor(
Service(name="myservice-ensure-up",monitor=(
HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[
Incident(errors=5,during=Time.s(5),actions=[
Restart("myservice-start.py")])])))).run()
We kill and restartmyservice-start.py
We kill and restartmyservice-start.py
ffunctioninc.
Example: Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
ffunctioninc.
Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
ffunctioninc.
Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
SystemInfo will retrievesystem information andreturn it as a dictionary
SystemInfo will retrievesystem information andreturn it as a dictionary
ffunctioninc.
Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
We log each result byextracting the given
value from the resultdictionary (memoryUsage,
diskUsage,cpuUsage)
We log each result byextracting the given
value from the resultdictionary (memoryUsage,
diskUsage,cpuUsage)
ffunctioninc.
Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
Bandwidth collectsnetwork interface
live traffic information
Bandwidth collectsnetwork interface
live traffic information
ffunctioninc.
Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
But we don't want thetotal amount, we justwant the difference.Delta does just that.
But we don't want thetotal amount, we justwant the difference.Delta does just that.
ffunctioninc.
Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
We print the resultas before
We print the resultas before
ffunctioninc.
Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
SystemHealth willfail whenever the usage
is above the giventhresholds
SystemHealth willfail whenever the usage
is above the giventhresholds
ffunctioninc.
Monitoring system health
from watchdog import *Monitor (
Service(name = "system-health",monitor = (
SystemInfo(freq=Time.s(1),success = (
LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda
r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),
)),Delta(
Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]
),SystemHealth(
cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]
),)
)).run()
We'll log failuresin a log file
We'll log failuresin a log file
ffunctioninc.
Watchdog: Decentralized architecture
APPSERVER
W
STATIC FILESERVER
DB SERVERSERVER
Ensures the App isrunning
(pid & HTTP test)
Ensures the App isrunning
(pid & HTTP test)
ffunctioninc.
Watchdog: Decentralized architecture
APPSERVER
W
STATIC FILESERVER
W
DB SERVERSERVER
Ensures the static fileserver is running
an has lowlatency
Ensures the static fileserver is running
an has lowlatency
ffunctioninc.
Watchdog: Decentralized architecture
APPSERVER
W
STATIC FILESERVER
W
DB SERVERSERVER
W
Ensures the DB isrunning and that
queriesare not too slow.
Ensures the DB isrunning and that
queriesare not too slow.
ffunctioninc.
Watchdog: Centralized Architecture
APPSERVER
STATIC FILESERVER
DB SERVERSERVER
PLATFORMSERVER
ffunctioninc.
Watchdog: Centralized Architecture
APPSERVER
STATIC FILESERVER
DB SERVERSERVER
PLATFORMSERVER
W
Does high-level (HTTP,SQL) queries on theservers and execute
actions remotelywhen problems
are detected
Does high-level (HTTP,SQL) queries on theservers and execute
actions remotelywhen problems
are detected
ffunctioninc.
Watchdog: Deploying on Ubuntu
# upstart - Watchdog Configuration File# =====================================# updated: 2011-02-28
description "Watchdog - service monitoring daemon"author "Sebastien Pierre <[email protected]>"
start on (net-device-up and local-filesystems)stop on runlevel [016]
respawn
script # NOTE: Change this to wherever the watchdog is installed WATCHDOG_HOME=/opt/services/watchdog cd $WATCHDOG_HOME # NOTE: Change this to wherever your custom watchdog script is installed python watchdog.pyend script
console output# EOF
ffunctioninc.
Watchdog: Deploying on Ubuntu
# upstart - Watchdog Configuration File# =====================================# updated: 2011-02-28
description "Watchdog - service monitoring daemon"author "Sebastien Pierre <[email protected]>"
start on (net-device-up and local-filesystems)stop on runlevel [016]
respawn
script # NOTE: Change this to wherever the watchdog is installed WATCHDOG_HOME=/opt/services/watchdog cd $WATCHDOG_HOME # NOTE: Change this to wherever your custom watchdog script is installed python watchdog.pyend script
console output# EOF
Save this file as/etc/init/watchdog.conf
Save this file as/etc/init/watchdog.conf
ffunctioninc.
Watchdog: Overview
Monitoring DSLDeclarative programming to define monitoring strategy
Wide spectrumFrom data collection to incident detection
FlexibleDoes not impose a specific architecture
ffunctioninc.
Watchdog: Use cases
Ensure service availabilityTest and stop/restart when problems
Collect system statisticsLog or send data through the network
Alert on system or service healthTake actions when the system stats is above threshold
ffunctioninc.
Watchdog: What's coming?
ZeroMQ channelsData streaming and inter-watchdog comm.
DocumentationOnly the basics, need more love!
Contributors?Codebase is small and clear, start hacking!
ffunctioninc.
Get started !
On Github:http://github.com/sebastien/watchdog
1 Python fileDocumented API