challenges of monitoring distributed systems - nenad bozic
TRANSCRIPT
Nenad Bozic, Co-founder and Technical Consultant @ SmartCat
Challenges of Monitoring Distributed Systems
Agenda
●Monitoring 101
●Metric data stream and tools
●Log data stream and tools
●Combine metrics and logs for full control
●Real world example
●SMART alerting
Monitoring 101
●Metrics data stream contains real-time data about application
(system components) performances
●Log data stream contains information about different activities and
events within the system
●Alerting is giving you the freedom not to watch graphs and logs all
the time
Metric data stream
●Metrics are indicators that everything is working within expected
boundaries
●Good dashboard has enough information (not too many, not to few)
●Metrics are indicator that something happened and logs provide
context what happened
Distributed system -> many graphs to watch -> information overload
Metric data stream - decision
●Cloud based solutions vs install yourself solutions
●Paying solutions vs free solutions
●Decision based on technical team skillset and security of your data
●Cloud-based: AppDynamics, AWS CloudWatch, DataDog
●Install yourself: Riemann, TICK stack
Metric data stream - stack
●Riemann as sinc that handles events and sends them to Riemann
server
●InfluxDB as NoSQL store which is build for measurements
●Grafana as visualization tool (flexible configurable graphs from
many data sources)
●Custom tools to use Riemann agent and send measurements to
Riemann server
Log data stream
●Log monitoring on single machine require skill and knowledge
●Same challenges as with metrics (not too many, not to few)
●Metrics are indicator that something happened and logs provide
context what happened
Distributed system -> many terminals open -> information overload
Log data stream - decision
●Cloud based solutions and install yourself solutions
●Paying solutions and free solutions
●Decision based on technical team skillset and security of your data
●Cloud-based: Loggly, Splunk and many more
●Install yourself: ELK, Splunk
Log data stream - ELK stack
●ELK (ElasticSearch, LogStash, Kibana) all open source
●ElasticSearch indexes log messages for easier searching
●Logstash can filter, manipulate and transform messages
●Kibana is visualization tool with filtering capabilities
●Filebeat is sending log mesages from instances
Combine logs and metrics
●It is much easier to look at graphs than logs
●Good metric coverage can pinpoint problems
●Usually we need log messages to bring context
●Grafana can combine InfluxDB (measurement data store) and
ElasticSearch (log index)
Real world example
●Provide SLA for 99.999% request that latency will be below some
threshold
●3 node application with 12 node Cassandra cluster
●Lot of metrics transferred to metrics machine
●Whole infrastructure deployed to AWS
●We needed fine grained diagnostics for queries to database both on
cluster and application level among other things
Real world example
Alerting
●Alerting is giving you freedom not to look graphs
●Someone else placed domain knowledge about alerts
●Alerting must not be frequent since you will end up ignoring alerts
Distributed system -> many alerts -> silently ignoring alerts
SMART Alerting
●Have more context when anomaly happens
●Have snapshot of the system at moment something happened
●Let system predict cause of malfunction
Conclusion
●Have right amount of information, not too many, not too few
●Having good selection of metrics and logs is iterative process
●Do not end up fixing monitoring machine instead of fixing application
code (especially in distributed world)
●Be proactive, not reactive
●Tailor metrics by your needs, build tools if there are not any that
suite your use case
Links
●Monitoring stack for distributed systems - SmartCat blog post
●Distributed logging - SmartCat blog post
●Metrics collection stack for distributed systems - SmartCat blog post
●Monitoring machine ansible project (Riemann, Influx, Grafana, ELK)
- SmartCat github project
Nenad Bozic, Co-founder and Technical Consultant @ SmartCat
Q&A