challenges of monitoring distributed systems - nenad bozic

25
Nenad Bozic, Co-founder and Technical Consultant @ SmartCat Challenges of Monitoring Distributed Systems

Upload: institute-of-contemporary-sciences

Post on 26-Jan-2017

28 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Challenges of Monitoring Distributed Systems - Nenad Bozic

Nenad Bozic, Co-founder and Technical Consultant @ SmartCat

Challenges of Monitoring Distributed Systems

Page 2: Challenges of Monitoring Distributed Systems - Nenad Bozic
Page 3: Challenges of Monitoring Distributed Systems - Nenad Bozic

Agenda

●Monitoring 101

●Metric data stream and tools

●Log data stream and tools

●Combine metrics and logs for full control

●Real world example

●SMART alerting

Page 4: Challenges of Monitoring Distributed Systems - Nenad Bozic

Monitoring 101

●Metrics data stream contains real-time data about application

(system components) performances

●Log data stream contains information about different activities and

events within the system

●Alerting is giving you the freedom not to watch graphs and logs all

the time

Page 5: Challenges of Monitoring Distributed Systems - Nenad Bozic
Page 6: Challenges of Monitoring Distributed Systems - Nenad Bozic

Metric data stream

●Metrics are indicators that everything is working within expected

boundaries

●Good dashboard has enough information (not too many, not to few)

●Metrics are indicator that something happened and logs provide

context what happened

Distributed system -> many graphs to watch -> information overload

Page 7: Challenges of Monitoring Distributed Systems - Nenad Bozic

Metric data stream - decision

●Cloud based solutions vs install yourself solutions

●Paying solutions vs free solutions

●Decision based on technical team skillset and security of your data

●Cloud-based: AppDynamics, AWS CloudWatch, DataDog

●Install yourself: Riemann, TICK stack

Page 8: Challenges of Monitoring Distributed Systems - Nenad Bozic

Metric data stream - stack

●Riemann as sinc that handles events and sends them to Riemann

server

●InfluxDB as NoSQL store which is build for measurements

●Grafana as visualization tool (flexible configurable graphs from

many data sources)

●Custom tools to use Riemann agent and send measurements to

Riemann server

Page 9: Challenges of Monitoring Distributed Systems - Nenad Bozic
Page 10: Challenges of Monitoring Distributed Systems - Nenad Bozic

Log data stream

●Log monitoring on single machine require skill and knowledge

●Same challenges as with metrics (not too many, not to few)

●Metrics are indicator that something happened and logs provide

context what happened

Distributed system -> many terminals open -> information overload

Page 11: Challenges of Monitoring Distributed Systems - Nenad Bozic
Page 12: Challenges of Monitoring Distributed Systems - Nenad Bozic

Log data stream - decision

●Cloud based solutions and install yourself solutions

●Paying solutions and free solutions

●Decision based on technical team skillset and security of your data

●Cloud-based: Loggly, Splunk and many more

●Install yourself: ELK, Splunk

Page 13: Challenges of Monitoring Distributed Systems - Nenad Bozic

Log data stream - ELK stack

●ELK (ElasticSearch, LogStash, Kibana) all open source

●ElasticSearch indexes log messages for easier searching

●Logstash can filter, manipulate and transform messages

●Kibana is visualization tool with filtering capabilities

●Filebeat is sending log mesages from instances

Page 14: Challenges of Monitoring Distributed Systems - Nenad Bozic
Page 15: Challenges of Monitoring Distributed Systems - Nenad Bozic

Combine logs and metrics

●It is much easier to look at graphs than logs

●Good metric coverage can pinpoint problems

●Usually we need log messages to bring context

●Grafana can combine InfluxDB (measurement data store) and

ElasticSearch (log index)

Page 16: Challenges of Monitoring Distributed Systems - Nenad Bozic
Page 17: Challenges of Monitoring Distributed Systems - Nenad Bozic

Real world example

●Provide SLA for 99.999% request that latency will be below some

threshold

●3 node application with 12 node Cassandra cluster

●Lot of metrics transferred to metrics machine

●Whole infrastructure deployed to AWS

●We needed fine grained diagnostics for queries to database both on

cluster and application level among other things

Page 18: Challenges of Monitoring Distributed Systems - Nenad Bozic

Real world example

Page 19: Challenges of Monitoring Distributed Systems - Nenad Bozic
Page 20: Challenges of Monitoring Distributed Systems - Nenad Bozic

Alerting

●Alerting is giving you freedom not to look graphs

●Someone else placed domain knowledge about alerts

●Alerting must not be frequent since you will end up ignoring alerts

Distributed system -> many alerts -> silently ignoring alerts

Page 21: Challenges of Monitoring Distributed Systems - Nenad Bozic
Page 22: Challenges of Monitoring Distributed Systems - Nenad Bozic

SMART Alerting

●Have more context when anomaly happens

●Have snapshot of the system at moment something happened

●Let system predict cause of malfunction

Page 23: Challenges of Monitoring Distributed Systems - Nenad Bozic

Conclusion

●Have right amount of information, not too many, not too few

●Having good selection of metrics and logs is iterative process

●Do not end up fixing monitoring machine instead of fixing application

code (especially in distributed world)

●Be proactive, not reactive

●Tailor metrics by your needs, build tools if there are not any that

suite your use case

Page 24: Challenges of Monitoring Distributed Systems - Nenad Bozic

Links

●Monitoring stack for distributed systems - SmartCat blog post

●Distributed logging - SmartCat blog post

●Metrics collection stack for distributed systems - SmartCat blog post

●Monitoring machine ansible project (Riemann, Influx, Grafana, ELK)

- SmartCat github project

Page 25: Challenges of Monitoring Distributed Systems - Nenad Bozic

Nenad Bozic, Co-founder and Technical Consultant @ SmartCat

Q&A