Nick Galbreath http://client9.com/20130501 @ngalbreath
Care and Feeding ofLarge Scale
Graphite Installations
Nick Galbreath ★ IPONWEBDevOpsDays ★ Austin Texas ★ 2013-04-30
Nick Galbreath http://client9.com/20130501 @ngalbreath
http://client9.com/20130501
Nick Galbreath http://client9.com/20130501 @ngalbreath
Who is nickg?
Nick Galbreath http://client9.com/20130501 @ngalbreath
(that's online advertising infrastructure)
Nick Galbreath http://client9.com/20130501 @ngalbreath
• Over One Billion Points collected daily.
• In "many" independent Graphite clusters.
so, graphite?
Nick Galbreath http://client9.com/20130501 @ngalbreath
Who cares? • Making it easy to create, analyze and
share data can change your organization
• Making a data-driven culture
• Empowering developers, operations, qa, security and business to be more confident in the changes they make.
Nick Galbreath http://client9.com/20130501 @ngalbreath
What is it you say you do here?
• Your job is likely invisible to rest of the organization
• invisible things aren't valued
• so make what you do visible
Nick Galbreath http://client9.com/20130501 @ngalbreath
Why Graphite?
• Many innovations in each part of the stack
• But, it's the Full Stack that really makes it special.
• On-disk layout to UI to API to... the community around it.
Nick Galbreath http://client9.com/20130501 @ngalbreath
Sharing in Caring
• Allows data to be easily accessed
• And easily shared. This makes it different than many monitoring solutions.
• It's your own in-house mashup generator.
What is it?
Nick Galbreath http://client9.com/20130501 @ngalbreath
The Big Picture
• It's a database
• But not ACID
• And without all the database tools
Nick Galbreath http://client9.com/20130501 @ngalbreath
Installation• 3 python-twistd servers
• "carbon-cache"
• "cache-aggregator"
• "carbon-relay"
• Apache / Django Web UI and API
• Uses SQLite3/MySQL for dashboards / events
Nick Galbreath http://client9.com/20130501 @ngalbreath
Be Current
• Don't use the OS default version.
• Newer point releases of graphite have significant improvements in storage engine and webui/api
• It's 100% Python so "building it yourself" shouldn't to hard.
• pip install works and is current.
Nick Galbreath http://client9.com/20130501 @ngalbreath
The Documentation
• everyone complains about it
• historically bad, but getting a lot better
• Switched locations, but not all searchengines are updated to use:http://graphite.readthedocs.org/
• Source code is quite good, so RTFS
Nick Galbreath http://client9.com/20130501 @ngalbreath
Storage Engine
Nick Galbreath http://client9.com/20130501 @ngalbreath
What is it?• Storage engine
• Handles reads, writes and creates of a single metric to a fixed size file.
• One file, kinda dumb (good).
• Here's the API:https://github.com/graphite-project/whisper/blob/master/whisper.py
Nick Galbreath http://client9.com/20130501 @ngalbreath
Graphite Math• About 12 bytes per point.
• Store 1 minute points for 1 month and 15 minutes for 11 months.
• (60×24×30 + 4×24×30×11) ×12 = 878kB
• If you can keep all your points in memory, then magic!
Nick Galbreath http://client9.com/20130501 @ngalbreath
Disk Layout
• Each metric create a directory treeserver123.myapp.logins.failed
• Makes 3 directories
• This creates a very branchy directory structure
• This has good and bad points.
Nick Galbreath http://client9.com/20130501 @ngalbreath
Middleware
carbon-cache
Part 1
Nick Galbreath http://client9.com/20130501 @ngalbreath
Carbon Cache
• Metrics go in
• csv file or python pickle format
• TCP
• Metrics go to disk
• whisper
Nick Galbreath http://client9.com/20130501 @ngalbreath
Write Buffer
• Most important feature is write buffering to protect the disk.
• Data is buffered and written out once per minute (or so).
• But
Nick Galbreath http://client9.com/20130501 @ngalbreath
The Cache• It's a write cache.
• Once data is written, it's out of the cache
• In other words, the cache is metrics not on disk.
• If the cache dies, you lose metrics
• (btw: the read cache is the os disk cache)
Nick Galbreath http://client9.com/20130501 @ngalbreath
New Metrics• New metrics are created automatically
• But, it is very expensive.
• MAX_CREATES_PER_MINUTE.=.50
• Saves your disk, but new metrics will "pile up" in cache.
• May take 10m+ for your metrics to start flowing....
Nick Galbreath http://client9.com/20130501 @ngalbreath
FALLOCATE• WHISPER_FALLOCATE_CREATE.=.True
• Linux&Kernel&>=&2.6.23
• fallocate is used to preallocate blocks to a file. For filesystems which support the fallocate system call, this is done quickly by allocating blocks and marking them as uninitialized, requiring no IO to the data blocks. This is much faster than creating a file by filling it with zeros.
• https://bugs.launchpad.net/whisper/+bug/957827
Nick Galbreath http://client9.com/20130501 @ngalbreath
Limit the SizeLimit the size of the cache to avoid swapping or becoming CPU bound.Sorts and serving cache queries gets more expensive as the cache grows.Use the value "inf" (infinity) for an unlimited cache size.
MAX_CACHE_SIZE = inf
No!.Infinity.does.not.exist.on.your.system!.&
Nick Galbreath http://client9.com/20130501 @ngalbreath
Graphite for GraphiteBy&default,&carbon&itself&will&log&statistics&(such&as&a&count,metricsReceived)&with&the&top&level&prefix&of&'carbon'&at&an&interval&of&60seconds.&Set&CARBON_METRIC_INTERVAL&to&0&to&disable&instrumentation
CARBON_METRIC_PREFIX&=&carbonCARBON_METRIC_INTERVAL&=&60
Nick Galbreath http://client9.com/20130501 @ngalbreath
Stats on
Stats!
Nick Galbreath http://client9.com/20130501 @ngalbreath
Middleware
carbon-aggregator
Part 2
Nick Galbreath http://client9.com/20130501 @ngalbreath
Pre-Aggregation• Sum or Average metrics based on
wildcards and regexps
• Helps eliminate very slow queries on webui
• You can emit the final sum & all the individual components or just the final sum (via blacklists)
Nick Galbreath http://client9.com/20130501 @ngalbreath
destination
• r/w to localhost
• split metrics to other aggregators
• Design your own system
Nick Galbreath http://client9.com/20130501 @ngalbreath
Along the way
• renaming of metrics
• whitelist and blacklist of aggregation and metrics
Nick Galbreath http://client9.com/20130501 @ngalbreath
Also...
• has support for broadcasting data to multiple downstream caches
• but.. never used it.. and seems at odds with the next middleware
Nick Galbreath http://client9.com/20130501 @ngalbreath
Middleware
carbon-relay
Part 3
Nick Galbreath http://client9.com/20130501 @ngalbreath
It's a Router!• Consistent Hashing (Sharding)
• Or more rule-based routing
• Output to multiple carbon servers
have not really used it much, but should work similarly to scale outs of memcache, redis
Nick Galbreath http://client9.com/20130501 @ngalbreath
Middleware
StatsDPart
4
Nick Galbreath http://client9.com/20130501 @ngalbreath
StatsD• https://github.com/etsy/statsd/
• nodejs based but lots of other implementations
• Receives UDP, send graphite-compatible output, flushed periodically.
• Aggregation for all by default
• Beside sum, also can compute other basic statistics (mean, 90% percentile), do sampling, have counters, etc.
Nick Galbreath http://client9.com/20130501 @ngalbreath
StatsD use case• It's UDP based, so it excels at
embedding a client inside the application
• UDP can't block or break the sending application
• Not so good for bulk metrics
• Use both! Can work together with aggregator.
Nick Galbreath http://client9.com/20130501 @ngalbreath
Of Note
• https://github.com/armon/statsite
• Need to look at this more
• c + libev based
• modern time series algorithms
• very flexible output
Nick Galbreath http://client9.com/20130501 @ngalbreath
Backups
Do you really need them?http://bit.ly/11sPhNz
Nick Galbreath http://client9.com/20130501 @ngalbreath
Backup
• Doing naive backup causes graphite performance goes to crap.
• File system cache is trashed
• Metrics are not written to disk (lag)
• If OOM occurs then you lose metrics.
Nick Galbreath http://client9.com/20130501 @ngalbreath
Do you need to save everything?
Nick Galbreath http://client9.com/20130501 @ngalbreath
nice is good
http://www.beenthereyet.net/nice-france
If you are doing your own backup....
Nick Galbreath http://client9.com/20130501 @ngalbreath
ionice is betterIONICE(1) User Commands IONICE(1)
NAME
ionice - set or get process I/O scheduling class and priority
SYNOPSIS
ionice [-c class] [-n level] [-t] -p PID... ionice [-c class] [-n level] [-t] command [argument...]
DESCRIPTION
This program sets or gets the I/O scheduling class and priority for a program. If no arguments or just -p is given, ionice will query the current I/O scheduling class and priority for that process.
When command is given, ionice will run this command with the given arguments. If no class is specified, then command will be executed with the "best-effort" scheduling class. The default priority level is 4.
NOTES
Linux supports I/O scheduling priorities and classes since 2.6.13 with the CFQ I/O scheduler.
util-linux July 2011 IONICE(1)
Nick Galbreath http://client9.com/20130501 @ngalbreath
Even Better
• Just write the metrics to two graphite servers in your client
• Script to copy / resync "holes" when restoring.
Nick Galbreath http://client9.com/20130501 @ngalbreath
Monitoring
Nick Galbreath http://client9.com/20130501 @ngalbreath
WebUI
• Hey, it's a web server
• do all the usual stuff
• Ask for known stats,
• check for 200
• check for valid json output
Nick Galbreath http://client9.com/20130501 @ngalbreath
Mistakes in URL or use of functions cause Server 500
Nick Galbreath http://client9.com/20130501 @ngalbreath
Old Stats
• Don't forget to kill off old metrics
• no updates in X days? kill.
• Exercise in "find" left to reader
Nick Galbreath http://client9.com/20130501 @ngalbreath
MySQL
• If you use SQLite3 -- uhh nothing to monitor
• If you use MySQL -- use the regular suspects
• And don't forget to backup!!
Nick Galbreath http://client9.com/20130501 @ngalbreath
CPU is stable• ... except for apache usage
• consider moving apache to separate machine
Nick Galbreath http://client9.com/20130501 @ngalbreath
Disk is SensitiveCompeting with other processes for disk does this
Nick Galbreath http://client9.com/20130501 @ngalbreath
Means less written to disk
Metrics updated
Nick Galbreath http://client9.com/20130501 @ngalbreath
Dangerous build up in cache
Cache Size
Rendering
Nick Galbreath http://client9.com/20130501 @ngalbreath
Tune Apache• By default, your Apache install is likely
to be "unlimited" in CPU and Memory usage.
• Select a wildcard metric for a long time period can easily turn a httpd process in 1GB. (this seems like a bug actually)
• OOM death.
Nick Galbreath http://client9.com/20130501 @ngalbreath
/version/
• Yes, ending "/" is required.
• Ok not that exciting but easy check
Nick Galbreath http://client9.com/20130501 @ngalbreath
/metrics/expand/
• /metrics/expand/?query=server*• {"results": ["server001", "server002", ... ]}
Nick Galbreath http://client9.com/20130501 @ngalbreath
/events/
• Ad-Hoc Events that don't deserve their own metric type.
• has tags, time, and text
• Stored in SQLite3 by default by the webapp.
• Rest UI is primitive
Nick Galbreath http://client9.com/20130501 @ngalbreath
The WebUI
• it's "ok".. good for experiments
• You will want to make your own dashboard.
• Good news! The API is a URL, so it's very easy.
Nick Galbreath http://client9.com/20130501 @ngalbreath
WebUI Dashboards
• The WebUI has a dashboard feature for loading and saving graphs
• It saves data in SQLite3 by default
• Since it's there people will use it
• So hack to remove it or, switch to MySQL.
Nick Galbreath http://client9.com/20130501 @ngalbreath
Granularity
• Like RRDTool, the resolution of the graph depends on number of pixels used. No sub-pixel rendering!
• Rapid spikes can be "averaged away" in week-long views in small graphs.
Nick Galbreath http://client9.com/20130501 @ngalbreath
Vertical Line Technology
• Easy to make horizontal lines
• Not so clear how to make ad-hoc vertical lines
Nick Galbreath http://client9.com/20130501 @ngalbreath
Turn "time since" into events
drawAsInfinite( removeAboveValue( keepLastValue( YOURMETRIC ), 120 ))
Nick Galbreath http://client9.com/20130501 @ngalbreath
Nick Galbreath http://client9.com/20130501 @ngalbreath
Turn Version Numbers
into EventsdrawAsInfinite( removeBelowValue( derivative(keepLastValue( YOURMETRIC)) ,0.1))
Nick Galbreath http://client9.com/20130501 @ngalbreath
Nick Galbreath http://client9.com/20130501 @ngalbreath
Arbitrary LinesdrawAsInfinite( removeBelowValue( removeAboveValue( time("time"), timestamp), timestamp))
Nick Galbreath http://client9.com/20130501 @ngalbreath
Really Long URLs
• Making graph but the URL is so long browsers are clipping them?
• Send query string data as a POST
Nick Galbreath http://client9.com/20130501 @ngalbreath
Client Side Rendering
• yeah...
• works ok with a small number of points
• crashes existing browsers with large number of points
• Server side faster in many cases!
• We'll try again in 2014
Nick Galbreath http://client9.com/20130501 @ngalbreath
Colors and ChartJunk
• Default color scheme is gross
• Be kind to the handicapped (uhh, me)http://colorbrewer2.org/
• Good overview here:http://bit.ly/10Hu7zU
Nick Galbreath http://client9.com/20130501 @ngalbreath
Looking for something to do?
Nick Galbreath http://client9.com/20130501 @ngalbreath
Accelerate with PyPy
• JIT for Python
• ~ 5.9x performance improvement
• Actually works and is stable
• Compatible with twisted and Django
Nick Galbreath http://client9.com/20130501 @ngalbreath
Accelerate with numpy
• numpy provides fast vector manipulation (C code)
• graphite web gui does a lot of vector manipulation
• hmmmm.....
Nick Galbreath http://client9.com/20130501 @ngalbreath
Ceres Storage Engine
• "Eventually Fixed Size" storage
• More space efficient == more performance
• seehttp://blog.sweetiq.com/2013/01/using-ceres-as-the-back-end-database-to-graphite/
Nick Galbreath http://client9.com/20130501 @ngalbreath
OpenTSB• Not Graphite, but similar in spirit
• Has "collectors" for basic ops stats
• Used by StumbleUpon, Box.net, pintrest
• Good: Stores data in HBASE/Hadoop
• Bad: Stores data in HBASE/Hadoop
Nick Galbreath http://client9.com/20130501 @ngalbreath
Add More Functions
• coursen (I'm looking at you Ian Malpass, that's useful for client-side rendering)
• Real vertical lines (our hacks are stupid)
• Better operators (would nice to know easily how many metrics you have, e.g. select count(*))
Nick Galbreath http://client9.com/20130501 @ngalbreath
Mine the Apache Log• Which stats are used the most?
• What are really slow queries?
• Can you optimize them?
• What time frames are used?
• How much old data do you really need to store?
it's in the
query string
Nick Galbreath http://client9.com/20130501 @ngalbreath
Add a TinyURL Feature
• The URLs get really long and are hard to put into email, etc.
• Make a tinyurl feature into the django app and integrate into dashboard.
Nick Galbreath http://client9.com/20130501 @ngalbreath
Write Docs
yeah you!
Nick Galbreath http://client9.com/20130501 @ngalbreath
Nick Galbreathhttp://www.client9.com/[email protected]
http://www.iponweb.com/[email protected]
Lets Make
Some Graphs!