anomaly detection and root cause analysis in distributed application transactions
TRANSCRIPT
![Page 1: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/1.jpg)
Anomaly Detection and Root Cause Analysis in Distributed Application Transactions
Yuchen Zhao @
![Page 2: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/2.jpg)
Software is Eating the World
![Page 3: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/3.jpg)
![Page 4: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/4.jpg)
![Page 5: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/5.jpg)
it’s critical to make surethe software
is running properly
![Page 6: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/6.jpg)
How? Through monitoring!
![Page 7: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/7.jpg)
Monitoring shouldn’t be very hard… right?
![Page 8: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/8.jpg)
Well, it can become a bit more complex...
![Page 9: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/9.jpg)
Or… really complex...
![Page 10: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/10.jpg)
![Page 11: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/11.jpg)
![Page 12: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/12.jpg)
Keep applications runningis hard.
![Page 13: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/13.jpg)
![Page 14: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/14.jpg)
Challenge 1:Enterprise applications are complex
![Page 15: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/15.jpg)
Challenge 2:Data is heterogeneous.
Its volume is massive and growing
![Page 16: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/16.jpg)
Challenge 3:Too many signals.
Finding anomalies & root causesare non-trival.
![Page 17: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/17.jpg)
Our solution: Relevant Fields
Machine Learning + Engineering
![Page 18: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/18.jpg)
Q1: How to get & organize data?
![Page 19: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/19.jpg)
Collect data in the form ofBusiness Transactions
![Page 20: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/20.jpg)
![Page 21: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/21.jpg)
Q2: Can you give a real use case?
![Page 22: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/22.jpg)
A hypothetical travel booking site with data in BT
![Page 23: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/23.jpg)
An unexpected incident:
![Page 24: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/24.jpg)
![Page 25: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/25.jpg)
Step 1: filtering
![Page 26: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/26.jpg)
Step 2: find relevant fields
![Page 27: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/27.jpg)
the relevancy score
“airline:AA” related transactions:● 2% occurrence normally among all
travel bookings● 82% of the current slow transactions
are from “AA”.● 41 times more significant than normal.
![Page 28: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/28.jpg)
What’s the root cause?
![Page 29: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/29.jpg)
Step 3: take actions!
![Page 30: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/30.jpg)
Q3: What’s the design of the system?
![Page 31: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/31.jpg)
Architecture Overview
![Page 32: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/32.jpg)
Data Collection
![Page 33: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/33.jpg)
Smart Code Instrumentation
watch every line of code, self-learning, automatic
![Page 34: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/34.jpg)
Stream Processing & Storage
![Page 35: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/35.jpg)
Relevant Fields Processing
![Page 36: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/36.jpg)
all transactions(baseline)
error/slow transactions(query)
Baseline & Query Sets
![Page 37: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/37.jpg)
Q4: How to score the field?
![Page 38: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/38.jpg)
all transactions(baseline)
error/slow transactions(query)
![Page 39: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/39.jpg)
Optimization: Dynamics Baseline
![Page 40: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/40.jpg)
Infer baseline context from query automatically
querytransactions
transactions of Entity 1
querytransactions
transactions of Entity 2
transactions of Entity n
![Page 41: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/41.jpg)
Baseline entity is auto learned from two dimensions:
● physical (applications, tiers, nodes, etc)
● temporal
![Page 42: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/42.jpg)
Score NormalizationNormalize the score using a function derived from
sigmoid:
![Page 43: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/43.jpg)
Score Example
![Page 44: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/44.jpg)
Fore more details, please check out our demo paper in ICDM 2015:
Discovering Anomalies and Root Causes in Applications via Relevant Fields Analysis,
in Proceedings of the 15th IEEE International Conference on Data Mining
![Page 45: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/45.jpg)
Ongoing work...Support rich data types
● time series
● text
● graphs
● ...
![Page 46: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/46.jpg)
We’re selling!
![Page 48: Anomaly detection and root cause analysis in distributed application transactions](https://reader030.vdocuments.us/reader030/viewer/2022020203/58f9b9781a28ab71488b45c7/html5/thumbnails/48.jpg)
Thank you!