challenges of monitoring distributed systems - nenad bozic

Nenad Bozic, Co-founder and Technical Consultant @ SmartCat

Challenges of Monitoring Distributed Systems

Agenda

●Monitoring 101

●Metric data stream and tools

●Log data stream and tools

●Combine metrics and logs for full control

●Real world example

●SMART alerting

Monitoring 101

●Metrics data stream contains real-time data about application

(system components) performances

●Log data stream contains information about different activities and

events within the system

●Alerting is giving you the freedom not to watch graphs and logs all

the time

Metric data stream

●Metrics are indicators that everything is working within expected

boundaries

●Good dashboard has enough information (not too many, not to few)

●Metrics are indicator that something happened and logs provide

context what happened

Distributed system -> many graphs to watch -> information overload

Metric data stream - decision

●Cloud based solutions vs install yourself solutions

●Paying solutions vs free solutions

●Decision based on technical team skillset and security of your data

●Cloud-based: AppDynamics, AWS CloudWatch, DataDog

●Install yourself: Riemann, TICK stack

Metric data stream - stack

●Riemann as sinc that handles events and sends them to Riemann

server

●InfluxDB as NoSQL store which is build for measurements

●Grafana as visualization tool (flexible configurable graphs from

many data sources)

●Custom tools to use Riemann agent and send measurements to

Riemann server

Log data stream

●Log monitoring on single machine require skill and knowledge

●Same challenges as with metrics (not too many, not to few)

●Metrics are indicator that something happened and logs provide

context what happened

Distributed system -> many terminals open -> information overload

Log data stream - decision

●Cloud based solutions and install yourself solutions

●Paying solutions and free solutions

●Decision based on technical team skillset and security of your data

●Cloud-based: Loggly, Splunk and many more

●Install yourself: ELK, Splunk

Log data stream - ELK stack

●ELK (ElasticSearch, LogStash, Kibana) all open source

●ElasticSearch indexes log messages for easier searching

●Logstash can filter, manipulate and transform messages

●Kibana is visualization tool with filtering capabilities

●Filebeat is sending log mesages from instances

Combine logs and metrics

●It is much easier to look at graphs than logs

●Good metric coverage can pinpoint problems

●Usually we need log messages to bring context

●Grafana can combine InfluxDB (measurement data store) and

ElasticSearch (log index)

Real world example

●Provide SLA for 99.999% request that latency will be below some

threshold

●3 node application with 12 node Cassandra cluster

●Lot of metrics transferred to metrics machine

●Whole infrastructure deployed to AWS

●We needed fine grained diagnostics for queries to database both on

cluster and application level among other things

Real world example

Alerting

●Alerting is giving you freedom not to look graphs

●Someone else placed domain knowledge about alerts

●Alerting must not be frequent since you will end up ignoring alerts

Distributed system -> many alerts -> silently ignoring alerts

SMART Alerting

●Have more context when anomaly happens

●Have snapshot of the system at moment something happened

●Let system predict cause of malfunction

Conclusion

●Have right amount of information, not too many, not too few

●Having good selection of metrics and logs is iterative process

●Do not end up fixing monitoring machine instead of fixing application

code (especially in distributed world)

●Be proactive, not reactive

●Tailor metrics by your needs, build tools if there are not any that

suite your use case

Links

●Monitoring stack for distributed systems - SmartCat blog post

●Distributed logging - SmartCat blog post

●Metrics collection stack for distributed systems - SmartCat blog post

●Monitoring machine ansible project (Riemann, Influx, Grafana, ELK)

- SmartCat github project

https://www.smartcat.io/blog/2016/monitoring-stack-for-distributed-systems/

https://www.smartcat.io/blog/2016/distributed-logging/

https://www.smartcat.io/blog/2016/metric-collection-stack-for-distributed-systems/

https://github.com/smartcat-labs/smartcat-ops/tree/master/projects/monitor

Nenad Bozic, Co-founder and Technical Consultant @ SmartCat

Q&A

challenges of monitoring distributed systems - nenad bozic

Data & Analytics