machine learning-based anomaly detection of ganglia
TRANSCRIPT
![Page 1: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/1.jpg)
Machine Learning-based Anomaly Detection of Ganglia Monitoring data
in HEP Data Center
Juan CHEN (Institute of High Energy Physics, CAS)
Lu WANG (Institute of High Energy Physics, CAS)
Qingbao HU (Institute of High Energy Physics, CAS)
CHEP2019, 4 November 2019 – Adelaide, Australia
2019/11/2 IHEP-CC 1
![Page 2: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/2.jpg)
Outline
1. Background
2. Motivation
3. Anomaly Detection Framework
4. Instance Analysis
5. Conclusion
2019/11/2 IHEP-CC 2
![Page 3: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/3.jpg)
Background
• HEP Data Center
20’000 CPU slots2000+ servers20 PB disk storage10 PB tape storage
• Growth of Data
Juno Experiment Data + LHAASO Experiment Data > 10PB/year
2019/11/2 IHEP-CC 3
![Page 4: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/4.jpg)
Motivation
• Larger Cluster Scale
• Challenges of Anomaly Detection of Large Scale Storage System
• Various Abnormal Situationstraffic abnormality / disk failure / operation exception / …
• Shortages of Traditional Methods (static thresholding / key words)• require expertise of certain software systems
• cannot be easy to transplant
• cannot easily adapt to the changes of workloads
• cannot easily adapt to the changes of hardware configurations
• Different Anomaly Types• Temporal Anomaly
• Spatial Anomaly2019/11/2 IHEP-CC 4
![Page 5: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/5.jpg)
Object
• Data-driven Anomaly Detection based on Machine Learning
• Anomaly Detection Framework
• Generic and Scalable• common function modules of anomaly detection tasks
• custom algorithm
• Simple and Convenient• good interactive interface
• Anomaly Detection based on the Framework
2019/11/2 IHEP-CC 5
![Page 6: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/6.jpg)
Anomaly Detection Framework Ⅰ
2019/11/2 IHEP-CC 6
1.Predict by historical metrics data2.Compare true data and predicted data3.Detect by deviation
1.Extract time series features2.Detect by extracted features
Detect by metrics data
Set time intervals and nodenames to ignore anomalies caused by particular circumstances
![Page 7: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/7.jpg)
Anomaly Detection Framework Ⅱ
2019/11/2 IHEP-CC 7
Elasticsearch
20+ generic metrics (hundreds of millions of historical data)about 100 special metrics (billions of historical data)
Elasticsearch-pythonQuery DSLjson->pandas.DataFrame
.html:• Sample Visualization• Model Training• Prediction• Detection• Sample Marking• Historical Data Comparison• Configuration Query
Based on Django
algorithm
urls.pyViews.py
User Interface
Djangomodel.py
![Page 8: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/8.jpg)
Algorithm Ⅰ— Prediction
• Statistical Machine Learning
• SMA/LR(Linear Regression)/
ARIMA / EWMA
– Specific model for each node
– Real-time training
– Lower complexity
2019/11/2 IHEP-CC 8
• Deep Learning
• LSTM
– Large-scale historical data
– Training in advance
– Generic for homogeneous nodes
– Higher complexity
![Page 9: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/9.jpg)
Algorithm Ⅱ— Feature Extraction
• Statistical Feature• Highest Value• Lowest Value• Mean• Variance• Standard Variance• ……The number of species is about 30
2019/11/2 IHEP-CC 9
• Fitting Feature• the difference between x and the
smoothed value after MA• difference between x and the
smoothed value after EWMA• ……The number of species is dependent on the length of time series
Extract time series features from time series
default: window=36,delta=5min3h = 36 * 5min
![Page 10: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/10.jpg)
Algorithm Ⅲ— DetectionⅠ
True value: 𝑥𝑡 Predicted value: 𝑦𝑡
Deviation: 𝑒𝑡 (𝑒𝑡 = Τ𝑥𝑡 − 𝑦𝑡 𝑦𝑡)
2019/11/2 IHEP-CC 10
𝑒𝑡𝑏𝑒𝑡𝑎
window
Assumption: the deviation within the time slide window conforms to a normal distribution
Threshold selecting according to probability:
1.N-sigmaset param k, if 𝒆𝒕 < 𝝁 − 𝒌𝝈 or 𝒆𝒕 > 𝝁 + 𝒌𝝈,detected as anomaly
2.Q-function𝑎𝑛𝑜𝑚𝑎𝑙𝑦𝑠𝑐𝑜𝑟𝑒 = 1 − 𝑄 𝑒𝑡
0.5 ≤ 𝑎𝑛𝑜𝑚𝑎𝑙𝑦𝑠𝑐𝑜𝑟𝑒 < 1
set param score,if 𝑒𝑡 > 𝑠𝑐𝑜𝑟𝑒,detected as anomaly
anomalynormal
anomaly anomaly
![Page 11: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/11.jpg)
Algorithm Ⅲ— DetectionⅡ
2019/11/2 IHEP-CC 11
Isolation Forest• Subsample• Build Trees based on Cutting
• Split randomly on features• Random samples• Each data tuple forms a leaf
node• Limited height
• Average path length will be shorter for anomalies
• Works well in high dimensional problems
Anomaly Score
S(x, n) = 2−𝐸(ℎ(𝑥))
𝑐(𝑛) ,
𝑐 𝑛 = 2𝐻 𝑛 − 1 − (2(𝑛 − 1)/𝑛)
𝐻 𝑖 = ln 𝑖 + 0.5772156649
averagepath length
![Page 12: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/12.jpg)
Algorithm Ⅳ— Performance Evaluation
• For predicting performance evaluating
True Value List : X, Predicted Value List : Y
– Mean Error (ME):1
𝑛σi=1𝑛 𝑥𝑖 − 𝑦𝑖
– Mean Absolute Error(MAE): 1
𝑛σi=1𝑛 𝑥𝑖 − 𝑦𝑖
– Root Mean Squared Error(RMSE):1
𝑛σi=1𝑛 𝑥𝑖 − 𝑦𝑖
2
– Mean Percentage Error(MPE): 1
𝑛σi=1𝑛 ൗ𝑥𝑖−𝑦𝑖
𝑥𝑖
– Mean Absolute Percentage Error(MAPE): 1
𝑛σi=1𝑛 ൗ𝑥𝑖−𝑦𝑖
𝑥𝑖
• Anomaly is rare, so there is no measure of detection accuracy at present.
We are more concerned with anomalies that cannot be detected in
traditional methods2019/11/2 IHEP-CC 12
![Page 13: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/13.jpg)
User Interface
2019/11/2 IHEP-CC 13
SampleVisualization
SampleMarking
ModelTraining
Prediction
DetectionComparison
ConfigurationQuery
![Page 14: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/14.jpg)
Instance Analysis—Lustre Metadata Servers
• Metrics• Network: bytes_in/bytes_out/pkts_in/pkts_out
• CPU: cpu_idle/load_one/load_five/load_fifteen
• Memory: mem_free/mem_buffers/mem_cached/
proc_run/proc_total/swap_free
2019/11/2 IHEP-CC 14
➢ Prediction– Predicted with MA/EWMA/LR/LSTM– Compare Prediction Results
➢ Detection– Detected with Temporal Anomaly Detection Algorithms
– Compare Two Methods (1.Prediction+Detection 2.Feature Extraction+ Detection )– Detected with Spatial Anomaly Detection Algorithm– Compare Temporal/Spatial Anomaly Detection
• Samples• Training Datasets:
≈140,000 samples(all metadata servers
2019.08.01-2019.09.30)
• Testing Datasets:
≈6000samples(mds01.ihep.ac.cn
2019.10.05-2019.10.25)
windows = 3 hours
![Page 15: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/15.jpg)
Prediction Results
2019/11/2 IHEP-CC 15
2.3369
1.8801
2.1803
1.8037
0
0.5
1
1.5
2
2.5
MA EWMA LR LSTM
cpu_idel_RMSE
270825
105891
182082
135482
0
50000
100000
150000
200000
250000
300000
MA EWMA LR LSTM
mem_free_RMSE206862
8159374
3
193394
0149342
1
0
500000
1000000
1500000
2000000
2500000
MA EWMA LR LSTM
bytes_in_RMSE
0.0414
0.01120.0136
0.0106
00.0050.01
0.0150.02
0.0250.03
0.0350.04
0.045
MA EWMA LR LSTM
cpu_idle_MAPE0.0042
0.0011
0.0025
0.0016
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
MA EWMA LR LSTM
mem_cached_MAPE0.4452
0.2592
0.3586
0.2278
00.050.1
0.150.2
0.250.3
0.350.4
0.450.5
MA EWMA LR LSTM
bytes_in_MAPE
According to statistical results with all metrics:(mds01_prediction_result.xlsx)
Network :LSTM performs better
CPU:LSTM performs better
Memory:EWMA performs better
LSTM predicts all metrics based on one model, EWMA predicts by real-time training model
![Page 16: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/16.jpg)
Temporal Anomaly Detection Comparison
2019/11/2 IHEP-CC 16
method1:Prediction + Detection
Prediction: LSTMDetection: Q-function
method2:Feature Extraction+ Detection
Feature Extraction:Statistical and Fitting Features
Detection: Isolation forest
• Red Line shows true values• Black Points show detected anomalies
• Two pictures show the different resultsfor the same data by using method1 and method2
– Method1 and method2 can detect the extreme fluctuations
– Method1 can detect the tiny fluctuations
![Page 17: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/17.jpg)
Temporal/Spatial Anomaly Detection Comparison
2019/11/2 IHEP-CC 17
Temporal anomaly detection
Detection: IforestFeatures: extracted features
Spatialanomaly detection
Detection: IforestFeatures: metrics data
• Red Line shows true values• Black Points show detected anomalies• Orange Line shows anomaly score
• Two pictures show the temporal anomalies and spatial anomalies for the same data
– Black points are not spatial anomalies– Black points are within normal limits without
considering temporal dimension– Black points are temporal anomalies in a certain
period of time– Flow of metadata servers are usually low, so the
anomaly score is a little high when bytes_in_valueis large
Compared to the traditional methods, these methods can detect more kinds of anomalies
![Page 18: Machine Learning-based Anomaly Detection of Ganglia](https://reader034.vdocuments.us/reader034/viewer/2022050500/62718c2af5f028595957db5e/html5/thumbnails/18.jpg)
Conclusion
• Data-driven anomaly detection is the trend
• Developed an anomaly detection framework • consists of common function modules of anomaly detection tasks
• generic and scalable
• Done proof of concept test for Lustre metadata servers• prediction and detection
• detected more temporal anomalies compared to traditional methods
• Anomaly filter modules will be added to the framework
• In the future, it will be applied to the production system
https://github.com/summerchenjuan/anomalydetection Thank you !2019/11/2 IHEP-CC 18