how to move from monitoring to observability · aws (archive) azure (application 1) vms database vm...
TRANSCRIPT
© 2018 SPLUNK INC.© 2018 SPLUNK INC.
How to Move From Monitoring
to ObservabilityObservability: the disingenuous rebranding of monitoring?
Dr. Siyka Andreeva | IT Operations Analytics Specialist
April 2019
© 2018 SPLUNK INC.
Forward Looking Statements
During the course of this presentation, we may make forward-looking statements regarding future events or
the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results could
differ materially. For important factors that may cause actual results to differ from those contained in our
forward-looking statements, please review our filings with the SEC.
The forward-looking statements made in this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, this presentation may not contain current or accurate
information. We do not assume any obligation to update any forward-looking statements we may make. In
addition, any information about our roadmap outlines our general product direction and is subject to change
at any time without notice. It is for informational purposes only and shall not be incorporated into any contract
or other commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other
brand names, product names, or trademarks belong to their respective owners. © 2017 Splunk Inc. All rights reserved.
© 2018 SPLUNK INC.
Agenda
What is observability ? And how it differs from monitoring?
Why is observability even a bigger challenge in a multi-cloud and containerized world?
How Splunk can help?
© 2018 SPLUNK INC.
What is Observability?
the disingenuous rebranding of monitoring ?
monitoring on steroids?
DevOpsifying monitoring?
© 2018 SPLUNK INC.
Observability…the word starts spreadingbecause failure is shifting to application code and in production system behavior
© 2018 SPLUNK INC.
Why the word starts spreading ?
IT Operations monitoring challenges are getting worse in a distributed world:• IT teams know that something is not working -- but not exactly why it’s not working
• Repetitive, manual processes for reactive troubleshooting
• Inability to get to root cause quickly
• Siloed analysis of logs, traces, and metrics
Management Expectations:• Avoid financial impact from fewer system outages
• Accelerate investigation of application performance and system incidents with real-time log and metric analysis
• Consolidate operational tools and/or external services into one observability tool
• Improve collaboration across teams with targeted alerting and tailored visualization increases collaboration across teams
Same for Dev teams:• Gap between perception and the reality
• Dev teams spending too much time observing the dev and pre prod environment instead of prod
© 2018 SPLUNK INC.
Why observability (in IT) ?
Source Wikipedia
Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it
past some selection process and overlooking those that did not, typically because of their lack of visibility. This
can lead to false conclusions in several different ways.
Shot down aircraft don’t
externalize their state
© 2018 SPLUNK INC.
What is observability?Ask 5 IT experts and you might get 9 different answers
Source Wikipedia
Observability background
“In control theory, observability is a measure of
how well internal states of a system can be inferred
from knowledge of its external outputs. The
observability and controllability of a system are
mathematical duals. ”
© 2018 SPLUNK INC.
What is observability ? (in IT – not only for serverless or DevOps)
“Focus on what you can’t see, the unknowns.
If the root cause of a failure stays invisible (the bullet
holes) your IT-plane will be shot down again”
© 2018 SPLUNK INC.
From monitoring to Observabilitywhat’s the difference?
MONITORING OBSERVABILITY
Mostly from Ops DevOps/SRE
It is A verbsomething you do to determine the state of an application a
system, a service…detect problems and anomalies, find the
root cause of problems &gain insights into performance
A noun, a thing you have – a property of a system
Tells you IF the system works
WHY
the system is not working as expected
Use it when You are looking for the overall health of systems, when
the system can externalize their state.
Monitoring falls short if a system or an app don’t adequality
externalize their state (granular insights into the behaviour
of systems along with rich context)
© 2018 SPLUNK INC.
From monitoring to the three (only three?) pillars of Observability
Inspired from © @copyconstruct
Symptoms (what’s broken?)
Mo
nito
rin
g
Alerting
Service health Overview
Investigation
All
the
tim
e
Pa
ssiv
e
Ops
Causes (why?)
Debugging
Profiling (system behavior)
Dependency analysis (distributed systems tracing infrastructure)O
bse
rva
bili
ty
On
th
e fly
Re
active
De
v
Events ProfilesPillar A
Pillar B
Pillar C
Pillar D
LOGS METRICS TRACES
© 2018 SPLUNK INC.
Why is that important in a multi-cloud environment?2019 trends
Business Logic
Monolithic
Architecture
Billing
Driver mgntUser mgnt
PaymentNotification
User
API
Driver
Trip mgnt
Microservices Architecture
User
API Gateway
Driver
Container
User mgnt
Container
Billing
Container
Notification
Container
Payment
Container
Driver mgnt
Container
Trip mgnt
Microservices
Business Intelligence
Legacy systems
Frontend
Storage
Compute
Security
?
Multi-Cloud
Hardware
OS
Libraries
App.
Bare metal
Hardware
Hypervisor
OS
Lib
App
OS
Lib
App
OS
Lib
App
Virtual
Machines
Hardware
OS
Container Mgr
Lib
App
Lib
App
Lib
App
Containers
Lib
App
Lib
App
Lib
App
Hardware
OS
Libraries
App Mgr
App AppApp
Serverless
(functions)
App AppApp
App AppApp
App AppApp
App AppApp
Containers / Kubernetes / Serverless
Observability in the distributed (and ephemeral)
systems/cloud space is non-negotiable
Distributed location / responsibilities Distributed systems/code
© 2018 SPLUNK INC.
Complexity Is Everywhere
ON PREMISES
Legacy systems
(Mainframe…)
Facilities
Dev/PreProd
Storage
Backup
Archive
DR
Security
VMs
Containers Micro
services
AWS (Application 1)Access / Security
Database
StorageDev
Compute
Containers
GCP(Big Data project 1)
DataflowApp engine
AWS (Archive)
Azure (Application 1)
VMs
Database
VM sets
Traffic mger
SAAS
EVENTS
LOGS &
REPORTS
Elastic Load Balancing
Access LogsAmazon CloudFront
Access logsAmazon CloudTrail logs
Billing Reports
Application Logs Application S3
access Logs
Other service logs AWS configs
snapshots & history
files
METRICS
EMR
ClusterAuto
Scaling
EVENTS
LOGS
RULES/EVENTS
Events
Logs
Push path (via Splunk HEC)
Your IT team
© 2018 SPLUNK INC.
Customer experience???
SAAS
What happens when we stack them? How does this
apply to you and your Ops teams?
ON PREMISES
Legacy systems
(Mainframe…)
Facilities
Dev/PreProd
Storage
Backup
Archive
DR
Security
VMs
Containers Micro
services
AWS (Application 1)Access / Security
Database
StorageDev
Compute
Containers
App engine
GCP(Big Data project 1)
Dataflow
AWS (Archive) Azure (Application 1)
VMs
Database
VM sets
Traffic mger
© 2018 SPLUNK INC.
Customer experience???
SAAS
The consequence: only green lights in the war room
ON PREMISES
Legacy systems
(Mainframe…)
Facilities
Dev/PreProd
Storage
Backup
Archive
DR
Security
VMs
Containers Micro
services
AWS (Application 1)Access / Security
Database
StorageDev
Compute
Containers
App engine
GCP(Big Data project 1)
Dataflow
AWS (Archive) Azure (Application 1)
VMs
Database
VM sets
Traffic mger
Cx
O
BLO
SAAS
CISODevSysAdmin
MKT
??
?
? ?
© 2018 SPLUNK INC.
Splunk for IT Operations
How do we help with Observability everywhere?
© 2018 SPLUNK INC.
A market leader
ITOM IT Operations Management
Tools to manage provisioning, capacity,
performance and availability of IT
OBSERVE
ITOA IT operations analytics
DECIDE
Practice of monitoring systems, and
gathering, processing, analyzing &
interpreting data from ITOps sources to
guide decisions & predict issues
AIOps
ACCELERATE
AIOps platforms enhance IT operations
through greater insights by combining
big data, machine learning and
visualization.
SIEM
PROTECT
security event information management)
#1
#2#1
SECURITY IT OPERATIONS
Sources: IDC and/or Gartner
#2
© 2018 SPLUNK INC.
We reached the limits of the traditional approach
Traditional Data Types
Not future proof
Complex
Never Change!
Untapped IT-generated
machine data (logs, metrics, wired data…)
Machine data is messy and unpredictable
Requires massive scale
You don’t always know which questions to ask
80%
© 2018 SPLUNK INC.
No
t co
nsu
mab
le b
y h
um
an
sC
on
su
mab
le b
y h
um
an
s
Industry Leading Platform For Machine Data
Online
ServicesNetworks
Security
Call Detail
Records
Web
Services
Telecoms
Web
Clickstreams
Tracing
Online
Shopping Cart
Smartphones
and Devices
Custom
Applications
Energy Meters
Storage
Public
Cloud Private
Cloud
Containers
On-Premises
ServersGPS
LocationRFID
Packaged
ApplicationsDatabases MessagingFirewall
Logs Wired DB Mobile IoT APIMetrics
DATA
Any Amount
Any Location
Any Source
No need to “adapt or
structure” the data
No database
No need to filter data
SPLUNKBASE 1600+ Free Apps/add-ons
SPLUNK PLATFORM Custom dashboards
Report & analyze
Monitor and alert
Developer Platform
Ad hoc search
On-prem or cloud
PREMIUM APPS “data scientist in a box”
IT Ops, DevOps Security Business Analytics, IoT
Different people asking different questions on the same data, in real time
3rd Party
Phantom Orchestration
VictorOps Collaboration
CMDB, SNOW…
Data lake
APM
Traces
APM
Tracing
© 2018 SPLUNK INC.
Structure Machine data = fighting a losing battle
© 2018 SPLUNK INC.
© 2018 SPLUNK INC.
How to find a needle in multiple haystacks?(choose your tool)
Network?
Database?
Middleware?
Hardware?
Wrong
command?
Connection?
Apache?VM?
Mainframe?
Load
balancer?Wrong code
released?
Collect ALL data• Collect from all silos
• Data in original raw format
• Add open sources apps to
ingest data on the fly
• Schema on the fly
• Dynamic thresholding
• Realtime correlation
Clustering & aggregation• Real time event
clustering/correlation
• Reduce alert noise
• Behavioural analytics
• Deduplication
Add context• Measure / report on
indicators that matters
• Add service / business
context
• Add actionable
information to detection
Salessso
Claims
Anomaly detection• Catch issues that thresholds
cannot
• Reduce event clutter
• Deviation from past
behaviour
• Deviation from peers
• Unusual change in features
Assisted deep dive
investigation• Root cause analysis
• Powerful & easy to use
search & investigate
language
?
Predictive
Analytics• Predict service health
• Predict events
• Trend forecasting
• Detect influencing
entities
• Early warning of
failure
70% to 90%Reduction in investigation time
15% to 45%Reduction in high priority incidents
67% to 82%Reduction in business
impact
© 2018 SPLUNK INC.
Unknow
nK
now
n
Aw
are
ness/
Data
Availa
ble
Knowns Unknowns
Understanding
Observability with Splunk
Known Knowns
(Known problem & solution)
Unknown Knowns
(didn’t realize but clear solution)
Known Unknowns
(we see the problem, not the solution)
Unknown Unknowns
(no idea it’ll happen)
Improve the Known-
Knowns
Dynamic thresholding,
automation, schema on fly,
real time dashboards…
Provide auto correlations, real
time search’s, analytics,
business process mining…
See the Known-
Unknowns
Discover the
Unknown-Knowns Anomaly detection, predictive IT… Ingest any data, ask any question,
get answers in real time…
Explore the
Unknown-Unknowns
© 2018 SPLUNK INC.
Easy to read
Interactive,
any question answered
Real-time information
Easily correlated with relational
and reference data
Splunk Dashboards: Complexity made simple
1010101101010
1010110111001
0101010100101
0110100101001
1101010101101
0101010011101
01010110111
Different people looking at different
dashboards on the same data, in real time
© 2018 SPLUNK INC.
It’s a journey
Search & Monitor(Any) Data collection
Real time monitoring/observability
Centralized Machine Data Search
Business Insights
Business KPIs
Insights to drive experienceOperational visibility
Service Oriented View
Root Cause analysis
Stabilize IT
Predict & Improve
Predict issues
Recommend actions based on prior behaviors
Increase MTBF
© 2018 SPLUNK INC.
Answer new questions, find new unknownsObserve | Monitor | Analyze | Act
© 2018 SPLUNK INC.
What to Do Today!
1. Download Splunk on Splunk.com or Try Free
for 15 Days on AWS Marketplace!
2. Download our Beginner’s guide to Observability
on Splunk.com
Good sources to watch
• Philipp Krenn presentation at DevOps Barcelona 2018• A great post by Cindy Sridharan on Monitoring & Observability.
• How to build observability into Serveless (O’reilly velocity 2018)
• Observability for the real world (by Andi Mann / datamation.com)
• The present and future of Serverless Observability (slideshare- Serverless Computing London, Yan Cui)
• Monitoring unknown unknows by Guy Fighel
© 2018 SPLUNK INC.© 2018 SPLUNK INC.
Thank you