machine learning-based anomaly detection of ganglia

18
Machine Learning-based Anomaly Detection of Ganglia Monitoring data in HEP Data Center Juan CHEN (Institute of High Energy Physics, CAS) Lu WANG (Institute of High Energy Physics, CAS) Qingbao HU (Institute of High Energy Physics, CAS) CHEP2019, 4 November 2019 – Adelaide, Australia 2019/11/2 IHEP-CC 1

Upload: others

Post on 04-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning-based Anomaly Detection of Ganglia

Machine Learning-based Anomaly Detection of Ganglia Monitoring data

in HEP Data Center

Juan CHEN (Institute of High Energy Physics, CAS)

Lu WANG (Institute of High Energy Physics, CAS)

Qingbao HU (Institute of High Energy Physics, CAS)

CHEP2019, 4 November 2019 – Adelaide, Australia

2019/11/2 IHEP-CC 1

Page 2: Machine Learning-based Anomaly Detection of Ganglia

Outline

1. Background

2. Motivation

3. Anomaly Detection Framework

4. Instance Analysis

5. Conclusion

2019/11/2 IHEP-CC 2

Page 3: Machine Learning-based Anomaly Detection of Ganglia

Background

• HEP Data Center

20’000 CPU slots2000+ servers20 PB disk storage10 PB tape storage

• Growth of Data

Juno Experiment Data + LHAASO Experiment Data > 10PB/year

2019/11/2 IHEP-CC 3

Page 4: Machine Learning-based Anomaly Detection of Ganglia

Motivation

• Larger Cluster Scale

• Challenges of Anomaly Detection of Large Scale Storage System

• Various Abnormal Situationstraffic abnormality / disk failure / operation exception / …

• Shortages of Traditional Methods (static thresholding / key words)• require expertise of certain software systems

• cannot be easy to transplant

• cannot easily adapt to the changes of workloads

• cannot easily adapt to the changes of hardware configurations

• Different Anomaly Types• Temporal Anomaly

• Spatial Anomaly2019/11/2 IHEP-CC 4

Page 5: Machine Learning-based Anomaly Detection of Ganglia

Object

• Data-driven Anomaly Detection based on Machine Learning

• Anomaly Detection Framework

• Generic and Scalable• common function modules of anomaly detection tasks

• custom algorithm

• Simple and Convenient• good interactive interface

• Anomaly Detection based on the Framework

2019/11/2 IHEP-CC 5

Page 6: Machine Learning-based Anomaly Detection of Ganglia

Anomaly Detection Framework Ⅰ

2019/11/2 IHEP-CC 6

1.Predict by historical metrics data2.Compare true data and predicted data3.Detect by deviation

1.Extract time series features2.Detect by extracted features

Detect by metrics data

Set time intervals and nodenames to ignore anomalies caused by particular circumstances

Page 7: Machine Learning-based Anomaly Detection of Ganglia

Anomaly Detection Framework Ⅱ

2019/11/2 IHEP-CC 7

Elasticsearch

20+ generic metrics (hundreds of millions of historical data)about 100 special metrics (billions of historical data)

Elasticsearch-pythonQuery DSLjson->pandas.DataFrame

.html:• Sample Visualization• Model Training• Prediction• Detection• Sample Marking• Historical Data Comparison• Configuration Query

Based on Django

algorithm

urls.pyViews.py

User Interface

Djangomodel.py

Page 8: Machine Learning-based Anomaly Detection of Ganglia

Algorithm Ⅰ— Prediction

• Statistical Machine Learning

• SMA/LR(Linear Regression)/

ARIMA / EWMA

– Specific model for each node

– Real-time training

– Lower complexity

2019/11/2 IHEP-CC 8

• Deep Learning

• LSTM

– Large-scale historical data

– Training in advance

– Generic for homogeneous nodes

– Higher complexity

Page 9: Machine Learning-based Anomaly Detection of Ganglia

Algorithm Ⅱ— Feature Extraction

• Statistical Feature• Highest Value• Lowest Value• Mean• Variance• Standard Variance• ……The number of species is about 30

2019/11/2 IHEP-CC 9

• Fitting Feature• the difference between x and the

smoothed value after MA• difference between x and the

smoothed value after EWMA• ……The number of species is dependent on the length of time series

Extract time series features from time series

default: window=36,delta=5min3h = 36 * 5min

Page 10: Machine Learning-based Anomaly Detection of Ganglia

Algorithm Ⅲ— DetectionⅠ

True value: 𝑥𝑡 Predicted value: 𝑦𝑡

Deviation: 𝑒𝑡 (𝑒𝑡 = Τ𝑥𝑡 − 𝑦𝑡 𝑦𝑡)

2019/11/2 IHEP-CC 10

𝑒𝑡𝑏𝑒𝑡𝑎

window

Assumption: the deviation within the time slide window conforms to a normal distribution

Threshold selecting according to probability:

1.N-sigmaset param k, if 𝒆𝒕 < 𝝁 − 𝒌𝝈 or 𝒆𝒕 > 𝝁 + 𝒌𝝈,detected as anomaly

2.Q-function𝑎𝑛𝑜𝑚𝑎𝑙𝑦𝑠𝑐𝑜𝑟𝑒 = 1 − 𝑄 𝑒𝑡

0.5 ≤ 𝑎𝑛𝑜𝑚𝑎𝑙𝑦𝑠𝑐𝑜𝑟𝑒 < 1

set param score,if 𝑒𝑡 > 𝑠𝑐𝑜𝑟𝑒,detected as anomaly

anomalynormal

anomaly anomaly

Page 11: Machine Learning-based Anomaly Detection of Ganglia

Algorithm Ⅲ— DetectionⅡ

2019/11/2 IHEP-CC 11

Isolation Forest• Subsample• Build Trees based on Cutting

• Split randomly on features• Random samples• Each data tuple forms a leaf

node• Limited height

• Average path length will be shorter for anomalies

• Works well in high dimensional problems

Anomaly Score

S(x, n) = 2−𝐸(ℎ(𝑥))

𝑐(𝑛) ,

𝑐 𝑛 = 2𝐻 𝑛 − 1 − (2(𝑛 − 1)/𝑛)

𝐻 𝑖 = ln 𝑖 + 0.5772156649

averagepath length

Page 12: Machine Learning-based Anomaly Detection of Ganglia

Algorithm Ⅳ— Performance Evaluation

• For predicting performance evaluating

True Value List : X, Predicted Value List : Y

– Mean Error (ME):1

𝑛σi=1𝑛 𝑥𝑖 − 𝑦𝑖

– Mean Absolute Error(MAE): 1

𝑛σi=1𝑛 𝑥𝑖 − 𝑦𝑖

– Root Mean Squared Error(RMSE):1

𝑛σi=1𝑛 𝑥𝑖 − 𝑦𝑖

2

– Mean Percentage Error(MPE): 1

𝑛σi=1𝑛 ൗ𝑥𝑖−𝑦𝑖

𝑥𝑖

– Mean Absolute Percentage Error(MAPE): 1

𝑛σi=1𝑛 ൗ𝑥𝑖−𝑦𝑖

𝑥𝑖

• Anomaly is rare, so there is no measure of detection accuracy at present.

We are more concerned with anomalies that cannot be detected in

traditional methods2019/11/2 IHEP-CC 12

Page 13: Machine Learning-based Anomaly Detection of Ganglia

User Interface

2019/11/2 IHEP-CC 13

SampleVisualization

SampleMarking

ModelTraining

Prediction

DetectionComparison

ConfigurationQuery

Page 14: Machine Learning-based Anomaly Detection of Ganglia

Instance Analysis—Lustre Metadata Servers

• Metrics• Network: bytes_in/bytes_out/pkts_in/pkts_out

• CPU: cpu_idle/load_one/load_five/load_fifteen

• Memory: mem_free/mem_buffers/mem_cached/

proc_run/proc_total/swap_free

2019/11/2 IHEP-CC 14

➢ Prediction– Predicted with MA/EWMA/LR/LSTM– Compare Prediction Results

➢ Detection– Detected with Temporal Anomaly Detection Algorithms

– Compare Two Methods (1.Prediction+Detection 2.Feature Extraction+ Detection )– Detected with Spatial Anomaly Detection Algorithm– Compare Temporal/Spatial Anomaly Detection

• Samples• Training Datasets:

≈140,000 samples(all metadata servers

2019.08.01-2019.09.30)

• Testing Datasets:

≈6000samples(mds01.ihep.ac.cn

2019.10.05-2019.10.25)

windows = 3 hours

Page 15: Machine Learning-based Anomaly Detection of Ganglia

Prediction Results

2019/11/2 IHEP-CC 15

2.3369

1.8801

2.1803

1.8037

0

0.5

1

1.5

2

2.5

MA EWMA LR LSTM

cpu_idel_RMSE

270825

105891

182082

135482

0

50000

100000

150000

200000

250000

300000

MA EWMA LR LSTM

mem_free_RMSE206862

8159374

3

193394

0149342

1

0

500000

1000000

1500000

2000000

2500000

MA EWMA LR LSTM

bytes_in_RMSE

0.0414

0.01120.0136

0.0106

00.0050.01

0.0150.02

0.0250.03

0.0350.04

0.045

MA EWMA LR LSTM

cpu_idle_MAPE0.0042

0.0011

0.0025

0.0016

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

MA EWMA LR LSTM

mem_cached_MAPE0.4452

0.2592

0.3586

0.2278

00.050.1

0.150.2

0.250.3

0.350.4

0.450.5

MA EWMA LR LSTM

bytes_in_MAPE

According to statistical results with all metrics:(mds01_prediction_result.xlsx)

Network :LSTM performs better

CPU:LSTM performs better

Memory:EWMA performs better

LSTM predicts all metrics based on one model, EWMA predicts by real-time training model

Page 16: Machine Learning-based Anomaly Detection of Ganglia

Temporal Anomaly Detection Comparison

2019/11/2 IHEP-CC 16

method1:Prediction + Detection

Prediction: LSTMDetection: Q-function

method2:Feature Extraction+ Detection

Feature Extraction:Statistical and Fitting Features

Detection: Isolation forest

• Red Line shows true values• Black Points show detected anomalies

• Two pictures show the different resultsfor the same data by using method1 and method2

– Method1 and method2 can detect the extreme fluctuations

– Method1 can detect the tiny fluctuations

Page 17: Machine Learning-based Anomaly Detection of Ganglia

Temporal/Spatial Anomaly Detection Comparison

2019/11/2 IHEP-CC 17

Temporal anomaly detection

Detection: IforestFeatures: extracted features

Spatialanomaly detection

Detection: IforestFeatures: metrics data

• Red Line shows true values• Black Points show detected anomalies• Orange Line shows anomaly score

• Two pictures show the temporal anomalies and spatial anomalies for the same data

– Black points are not spatial anomalies– Black points are within normal limits without

considering temporal dimension– Black points are temporal anomalies in a certain

period of time– Flow of metadata servers are usually low, so the

anomaly score is a little high when bytes_in_valueis large

Compared to the traditional methods, these methods can detect more kinds of anomalies

Page 18: Machine Learning-based Anomaly Detection of Ganglia

Conclusion

• Data-driven anomaly detection is the trend

• Developed an anomaly detection framework • consists of common function modules of anomaly detection tasks

• generic and scalable

• Done proof of concept test for Lustre metadata servers• prediction and detection

• detected more temporal anomalies compared to traditional methods

• Anomaly filter modules will be added to the framework

• In the future, it will be applied to the production system

https://github.com/summerchenjuan/anomalydetection Thank you !2019/11/2 IHEP-CC 18