monitoring @ scalefiles.meetup.com/10485232/graphite.pdf · current production setup - writes 450+...
TRANSCRIPT
![Page 1: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/1.jpg)
MONITORING @ SCALECLOUD SCALE
Sławomir Skowron Devops @ BaseCRM
Devops Kraków 2015
![Page 2: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/2.jpg)
OUTLINE• What is Graphite ?
• Graphite architecture
• Additional components
• Current production setup
• Writes
• Reads
• BaseCRM graphite evolution
• Data migrations and recovery
• Multi region
• Dashboards management
• Future work
![Page 3: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/3.jpg)
Monitoring system focused on:
• Simple store metrics time series
• Render graphs from time series on demand
• API with functions
• Dashboards
• Huge number of tools and 3rd party products based on graphite
WHAT IS GRAPHITE ?
![Page 4: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/4.jpg)
WHY ALL THIS ?
LET’S LOOK AT EXAMPLE IN GRAFANA
![Page 5: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/5.jpg)
![Page 6: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/6.jpg)
GRAPHITE ARCHITECTURE
![Page 7: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/7.jpg)
• Graphite-Web - Django web application with JS frontend
• Dashboards (DB to save dashboards)
• API with functions, server side graphs rendering
• Carbon - Twisted daemon
• carbon-relay - hash / route metrics
• carbon-aggregator - aggregate metrics
• carbon-cache - “memory cache” and persist metrics to disk
• Whisper - simple time series DB
• seconds to point
Data points send as: metric name + value + Unix epoch timestamp
GRAPHITE ARCHITECTURE
![Page 8: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/8.jpg)
GRAPHITE ARCHITECTURE
![Page 9: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/9.jpg)
ADDITIONAL COMPONENTS
![Page 10: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/10.jpg)
ADDITIONAL COMPONENTS
• Diamond - https://github.com/BrightcoveOS/Diamond
• Python daemon
• over 120 collectors
• simple collectors development
• used for OS and generic services monitoring
![Page 11: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/11.jpg)
ADDITIONAL COMPONENTS
• Statsd - https://github.com/etsy/statsd/
• Node.js daemon
• counts, sets, gauges, timers aggregates sends to graphite
• many clients library’s
• used for in app metrics
![Page 12: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/12.jpg)
CURRENT PRODUCTION SETUP
WRITES
![Page 13: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/13.jpg)
ALL PROVISIONED BY ANSIBLE
![Page 14: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/14.jpg)
![Page 15: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/15.jpg)
Which components failed to work at scale ?
carbon-relay
switch to
Carbon-c-Relay
carbon-cache - switch to PyPy
![Page 16: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/16.jpg)
REPLACEMENT• Carbon-C-Relay - https://github.com/grobian/carbon-c-relay
• Written in C
• replacement for carbon-relay in python
• High performance
• multi cluster support (traffic replication)
• traffic load-balancing
• traffic hashing
• Aggregation and rewrites
![Page 17: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/17.jpg)
IMPROVECarbon-cache
Switch to PyPy (2.4 and current 2.5)
40-50% less CPU usage on carbon-cache
![Page 18: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/18.jpg)
CURRENT PRODUCTION SETUP - WRITES
![Page 19: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/19.jpg)
CURRENT PRODUCTION SETUP - WRITES
VM’s report to ELB
Round-Robin to Relay Top
Consistent hash with replication 2
Any of Carbon-cache on each store instance
Write to local whisper store volume
![Page 20: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/20.jpg)
CURRENT PRODUCTION SETUP - WRITES450+ instances as clients (diamond, statsd, other)report in 30 seconds intervals
carbon-c-relay as Top Relay• 3.5 - 4 mln metrics / min• 20% CPU usage on each• batch send (20k metrics) • queue 10 mln metrics
carbon-c-relay as Local Relay• 7- 8 mln metrics / min• batch send (20k metrics)• queue 5 mln metrics
Carbon-cache with PyPy (50% less CPU)• 7 - 8 mln metrics / min• each point update 0.13 - 0.15 ms
250K-350K Write IOPS 5-6 mln whisper DB files (2 copies)
![Page 21: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/21.jpg)
CURRENT PRODUCTION SETUP - WRITES
Graphite hashing - max space/performance like weakest host in cluster
![Page 22: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/22.jpg)
CURRENT PRODUCTION SETUP - WRITES
![Page 23: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/23.jpg)
CURRENT PRODUCTION SETUP - WRITES
• Minimise number of other processes and CPU usage• CPU Offload
• Carbon-c-relay low cpu, • Batch writes, • Separate webs for clients from store hosts • Focus on carbon-cache (Write) + graphite-web (Read)
• Leverage OS memory for carbon-cache• Raid0 for more write performance - we have replica• Focus on IOPS - low service time• Time must be always sync
![Page 24: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/24.jpg)
CURRENT PRODUCTION SETUP
READS
![Page 25: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/25.jpg)
CURRENT PRODUCTION SETUP - READS
![Page 26: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/26.jpg)
CURRENT PRODUCTION SETUP - READS
Web front dashboard based on Graphite-Web
Graphite-web django backend as API for Grafana
Couchbase as cache for graphite-web metrics
Each store as API via graphite-web
Average response <300ms
Nginx on top behind ELB
Webs calculates functions, stores serves RAW metrics
(CPU offload)
![Page 27: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/27.jpg)
BASECRM GRAPHITE
EVOLUTION
![Page 28: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/28.jpg)
BASECRM GRAPHITE EVOLUTION• PoC with external EBS and C3 instances
• graphite-web, carbon-relay, carbon-cache, whisper files on EBS • Production started on i2.xlarge
• 5 store instances - 800GB SSD, 30GB RAM, 4xCPU’s• 4 carbon-cache’s on each store• Same software as in PoC• Problems with machines replace and migrations to bigger cluster• dash-maker to manage complicated dashboards
• Next with i2.4xlarge • 5 store instances - 2x800GB in Raid0, 60GB RAM, 8xCPU’s• 8 carbon-cache’s on each store • carbon-c-relay as Top and Local relay• Recovery tool to recover data from old cluster• Grafana as second dashboard interface
• Current with i2.8large - latest bump • 5 store instances - 4x800GB in Raid0, 120GB RAM, 16xCPU’s• 16 carbon-cache’s on each store
![Page 29: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/29.jpg)
DATA MIGRATION &
RECOVERY
![Page 30: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/30.jpg)
DATA MIGRATION & RECOVERY
Replicate Traffic
Copy old whispers, based on new cluster creates
![Page 31: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/31.jpg)
DATA MIGRATION & RECOVERY
5 instances with 1Gbit/s - recovery tops 4.5Gbit/s using HTTP
![Page 32: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/32.jpg)
DATA MIGRATION & RECOVERY
Switch on ELB
![Page 33: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/33.jpg)
DATA MIGRATION & RECOVERY
Remove old cluster
![Page 34: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/34.jpg)
MULTI REGION
METRICS COLLECTING
![Page 35: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/35.jpg)
MULTI REGION
![Page 36: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/36.jpg)
DASH-MAKER
INTERNAL DASHBOARDS MANAGEMENT
![Page 37: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/37.jpg)
DASH-MAKER
• Manage dashboards like never before
• Template everything with Jinja2 (all jinja2 features)
• Dashboard config - one YAML with Jinja2 support
• Reusable graphs - Json's like in graphite-web with Jinja2 support
• Global key=values for Jinja2
• Dynamic Jinja2 vars expanded from graphite (last * in metric name)
• Many dashboards options from one config based on loop vars
• supports graphite 0.9.12, 0.9.12 (evernote), 0.9.13, 0.10.0
![Page 38: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/38.jpg)
DASH-MAKER
$ dash-maker -f rabbitmq-server.yml23:54:55 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 023229 build: 0. 000177 save: 2.546407] Dashboard dm.us-east- 1.production.rabbitmq. server saved with success23:54:58 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 017746 build: 0. 000057 save: 2.549711] Dashboard dm.us-east- 1.sandbox.rabbitmq. server saved with success
![Page 39: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/39.jpg)
FUTURE WORK AND PROBLEMS
![Page 40: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec5d1dbacb7740ab05168a9/html5/thumbnails/40.jpg)
FUTURE WORK AND PROBLEMS
• Dash-maker with Grafana support
• Out-o-band fast aggregation with anomaly detection
• Graphite with Hashing is not elastic - InfluxDB ? march prod ready ?
• In future one dashboard - grafana + influxdb ?