atwo-layerarchitectureforfailurepredictionbasedon high

Research ArticleA Two-Layer Architecture for Failure Prediction Based onHigh-Dimension Monitoring Sequences

Xue Wang1 Fan Liu 1 Yixin Feng2 and Jiabao Zhao3

1School of Management amp Engineering Nanjing University Nanjing China2Guotai Junan Securities Shanghai China3Department of Control and Systems Engineering Nanjing University Nanjing China

Correspondence should be addressed to Fan Liu liufannjueducn

Received 15 December 2020 Revised 17 February 2021 Accepted 17 March 2021 Published 29 March 2021

Academic Editor Zhen Zhang

Copyright copy 2021 Xue Wang et al +is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In recent years the distributed architecture has been widely adopted by security companies with the rapid expansion of theirbusiness A distributed system is comprised of many computing nodes of different components which are connected by high-speed communication networks With the increasing functionality and complexity of the systems failures of nodes are inevitablewhich may result in considerable loss In order to identify anomalies of the possible failures and enable DevOps engineers tooperate in advance this paper proposes a two-layer prediction architecture based on the monitoring sequences of nodes statusGenerally speaking in the first layer we make use of EXPoSE anomaly detection technique to derive anomaly scores in constanttime which are then used as input data for ensemble learning in the second layer Experiments are conducted on the data providedby one of the largest security companies and the results demonstrate the predictability of the proposed approach

1 Introduction

Due to the complexity of financial services the financialindustry demands applicable reliable and robust businesssystems provided by security companies For this purposethe distributed architecture with its strength in dealing withlarge-scale data and high elasticity has been widely applied[1] Generally a distributed system is comprised of com-puting nodes of different components Each node isarranged to perform a specific computation task and isconnected to others through high-speed communicationnetwork

Along with the rapid growth of financial market thefunctionality and complexity of the distributed businesssystems are increasing As a result the failures of nodes areinevitable which may lead to system crash Loss from systemcrash could be very large For example Amazon lost nearly 2million dollars in about 30minutes of service downtimecaused by network problem As is reported by Silva andAlonso [2] the cost during downtime can be as high as 6million dollars per hour for online brokerage services

+erefore the task of predicting potential failures of nodes isof great importance to guarantee the sustainability ofsystems

Motivated by this some companies construct elaboratemonitoring infrastructures to collect real-time attributes ofeach node which reflect systemsrsquo health in time +e at-tributes of interest include the load of CPUs the usage ofmemory and the storage forming high-dimensional dataAt the same time cloud computing techniques along withmachine learning methods can help analyse the monitoringdata [3] Specifically detecting anomalies assists companiesto predict failures and prepare for system recovery Anomalydetection refers to the problem of identifying anomalouspatterns in data and the deployment of detectors to KPIs(key performance indicators) metrics of a distributed systemhas been widely adopted by many Internet companies [4ndash6]For interesting researchers we recommend review papers byChandola et al [7] Agrawal [8] and Xu et al [9]

Our paper proposes a two-layer potential failure pre-diction framework Specifically the first layer generatesanomaly scores based on high-dimensional monitoring data

HindawiComplexityVolume 2021 Article ID 6623666 9 pageshttpsdoiorg10115520216623666

of a given component by using a real-time unsupervisedanomaly detector A higher score indicates a higheranomalous probability of the current state of the sequenceBased on anomaly scores the second layer employs randomforest one of the successful ensemble classification methodsto predict whether nodes will fail within a given timeinterval

+e main contributions of this study are as follows Firstit describes a real-time efficient and applicable method toidentify whether the components of distributed systemswould fail or not in the future based on the self-monitoringdata +e online method is essential for industrial big datascenario such as the Internet of +ings (IoT) in the future[10] We apply the methodology on the data provided by oneof the largest security companies in China According to atest on 6-day data the precision is over 088 and the recallcould be as high as 061 with proper forecasting interval time+e results demonstrate the predictability of our method Inthat this paper enriches the study on online failure pre-diction for distributed systems Second we take advantage ofensemble learning to leverage the performance of anomalydetector and give some IT operation suggestions from ourexplainable model for DevOps engineers Compared withresults using only one detector in the first layer ensemblelearning provides higher prediction accuracy and moreknowledge discovery

+e reminder of paper is organized as follows Section 2gives a brief review on anomaly detection and fault pre-diction A detailed description of the proposed predictionframework is presented in Section 3 followed by a case studywith discussion in section 4 Finally Section 5 includesseveral general conclusions and directions for future work

2 Literature Review

Many previous works focus on failure prediction and di-agnose based on text data such as logs reported issues andwarnings [11ndash14] However due to the irregularities of textdata these methods build complex models for betterperformance

Nowadays more researchers directly analyse temporalmonitoring data Nosayba El-Sayed et al identified relativefactors of unsuccessful jobs from log and then designed arandom forests framework learning form configurationmessage and monitoring metrics to predict the unsuc-cessful termination of a job [15] He et al proposed Log3C[11] a novel cascading clustering algorithm for log se-quences and then identified the impactful problems bycorrelating the clusters of log sequences with system KPIsLin et al introduced a failure prediction technique MING[16] which combines a LSTM model incorporating thetemporal data with a random forest model consumingspatial data Haowen Xu et al proposed Donut [17] anunsupervised anomaly detection algorithm based on VAEfor Seasonal KPIs Zhao et al proposed eWarn [18] whichextracts a set of features (including textual features andstatistical features) to train classification models (XGBoost)with alert However there are tens of TBs of logs andthousands of metrics per day for a large-scale distributed

system in reality meaning it is a great challenge to selectuseful features train and update a prediction model [19]Besides some black-box ML and DL models like LSTMandMING-based models is hard to explain the results +elack of decision interpretability is not convenient formaintenance work of DevOps engineers [20] +ereforethis paper calculate the monitoring time series anomalyscore in real-time which is automatic feature engineeringto build a random forest-based prediction model in thesecond layer

+ere is extensive work on anomaly detection methodsthat can be mainly categorized into classification-basedstatistics-based like distanced-based [21ndash24] density-based[25] nonlinear model based [26] and so on But most ofthese approaches are required the full dataset and not un-suitable for real-time streaming sequences For sequencedata the length increases as time goes Some methods are tomake one-step ahead forecast and defines the observationswith relative large prediction errors as anomalies +eprediction methods that have been adopted include auto-regressive integrated moving average (ARIMA) [27] ex-ponential smoothing [28] long short-term memory (LSTM)[29] and so on +e s-based methods like extreme stu-dentized deviate k-sigma median absolute deviation [30]and Bayesian online detection [31] are also applied for se-quence data with proper selection of time window Ouranomaly detection method is also this genre But whatdistinguishes is that we chose a kernel-based method whichmake no assumption about the data distribution and score sofast only in constant time

3 Prediction Framework

Let Xit (xi1 xi2 xit) i 1 2 p t 1 2 denotethe monitoring data for ith attribute of one certain nodeuntil time t+en themonitoring data for p attributes of onenode forms multidimensional sequences denoted by Xt and

Xt

x11 x12 middot middot middot x1t


⋮ ⋮ ⋱ ⋮

xp1 xp2 middot middot middot xpt

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(1)

In this section we construct a two-layer predictionframework which transforms Xt into the classification ofwhether the node will fail within time interval (t t + H]where H is the prespecified length of time as is shown inFigure 1 +e tow-layer architecture consists of one anomalydetection layer and one classification layer

+e anomaly detection layer contains a real-timeanomaly detector EXPoSE [32] For each monitoring se-quence the detector generates anomaly score vectorSit (si1 si2 sit) i 1 2 p where sit quantifies theanomalous level of the ith sequence at time t +e higher theanomaly scores the higher the probability that the sequencesare anomalous +ese scores are then used as input for thesecond layer where a classificationmodel is built+e detailsof both layers are presented below

2 Complexity

31AnomalyDetectionLayer In this paper we choose a real-time anomaly detector for sequence EXPected SimilarityEstimation (EXPoSE) +ey are chosen because (1) the real-time scoring is fast only in constant time and in a fixedamount of memory (2) EXPoSE the kernel-based detectormakes no prior assumption about the data distribution whichcould score different types sequences (3) +e anomaly scorequantifying the anomalous level is the base of interpretabilityin the second layer

+e EXPected Similarity Estimation (EXPoSE) computesthe similarity between new data and the distribution ofhistorical normal data [32] Anomaly scores are representedby the opposite side of similarity Specifically EXPoSE usesthe inner product of feature map ϕ(xt) and kernel meanmap μ[Pt] to represent the normal likelihood η(xt) of thecurrent observation (the likelihood of xt is normal) We have

η xt( 1113857 ϕ xt( 1113857 middot μ Pt1113858 1113859 (2)

where xt is a random variable whose values is in a mea-surable spaceX Let P be a Borel probability measure onXwhich represents the distribution of normal data μ[Pt] is thekernel embedding of X defined as

μ Pt1113858 1113859 1113946X

k xt middot( 1113857dP xt( 1113857 (3)

+e kernel mean map μ[Pt] is approximated by em-pirical embedding of samples from P

1113954μ Pt1113858 1113859 1t

1113944

t

i1ϕ xt( 1113857 (4)

As for the feature map ϕ(xt) it is approximated based onrandom kitchen sinks (RBS) [33] with Gaussian Radial BasisFunction (RBF) kernel We first draw N samplesG (G1 G2 GN)T from univariate Gaussian distribu-tion then approximate ϕ(xt) by

1113954ϕ xt( 1113857 1N

radic exp iGxt( 1113857 (5)

At last the anomaly score denoted by s4t could bederived by

s4t 1 minus 1113954ϕ xt( 1113857 middot 1113954μ Pt1113858 1113859 (6)

Furthermore sliding windowing and decay mechanismhelp to adapt the concept drift [34]+e sliding windowing is

Output failure prediction result

Layer 2 the classification layer

Layer 1 the anomaly detection layer

Scores S1t Scores SptScores S2t Scores Scores

Detector

X1t X2t X3t Xpthellip

Input sequences

Figure 1 Two-layer prediction framework

Complexity 3

requiring keep the past part of observations in memory(window size l) and updating incrementally

wt 1l

1113944

t

itminusl+1ϕ xi( 1113857 (7)

Due to time series dependence the decay mechanismpays more attention on recent data

wt cϕ xt( 1113857 +(1 minus c)wtminus1 for tgt 1 (8)

With the above anomaly detector for one node with p

attributes there are correspondingly p scores at time t+erefore we adopt ensemble learning approach in thesecond layer

32ClassificationLayer Random forest is a strong ensemblelearning method to uncover the law between features andlabels It consists of a series of decision trees which greedilyand recursively chooses an available feature to split the datainto purer subsets measured by the Gini index

321 Decision Tree If the data D contain K classes (in ourcase K 2) and for each class there is a proportion of pk inthe data then the Gini index for data D is

Ent(D) minus 1113944K

k1pklog2 pk( 1113857 (9)

If D is split into (D1 D2 Da) subsets according tofeature a and each subset Da is with size |Da| then theinformation gain of D after this split is

Gain(D a) Ent(D) minus 1113944V

v1

Dv

DEnt Dv( 1113857 (10)

+e bigger the Gain(D a) the best it splits the dataSuppose A is the set of features then the best feature alowast ischosen by

alowast argmaxaisinA

Gain(D a) (11)

As the features are all continuous we build binary trees+at is for each subset of the original data we split it intotwo parts as long as there is any available feature and theincrease of information gain is significant

322 Random Forest Decision tree is easy-to-implementbut may not perform well in classification especially whenthe dimension of features is high Random forest classifiesobservations based on the majority votes across all the treeswhich is proved to be more precise and robust

As is illustrated in Figure 1 each decision tree is built ona subset of features which are randomly selected from the p

features and a dataset sampled from the original data bybootstrap methods+en the final failure prediction is madeby voting by all decision trees

4 Case Study

41DataDescription We collect data from one of the largestsecurity companies in China using its self monitoring sys-tem +e system consists of Logstach which parses all kindsof logs Kafka which transmits high volumes of data toJstorm with low latency Jstorm which offers on-the-fly dataaggregation and calculation and Elasticsearch which storessearches and analyzes large amounts of data +is moni-toring system generates alerts when the system breaks somerules set by the DevOps engineers or softwares +ese alertscontain information about potential failures [35 36] Spe-cifically we list all kinds of alerts and relative potentialfailures in self monitoring system (Table 1) For examplewhen thread fails the monitoring system will label this timewith one failure event and record it in the logs With such asystem one can track the status of the nodes as they evolvewith time

Finally the data used contain 30 nodes from Elasticsearchcomponent which save the companyrsquos centralized tradingsystem data each is observed on five attributes namelyavailable storage of the node (ASN) usage of Java VirtualMachine (JVM) memory of heap (UJM) recent historicalCPU usage for the whole system (HLC) one-minute currentload average of CPU (CLU) and disk IO time (DIO) +eattributes are recorded every minute from November 8 2018to November 22 2018 Since some sequences basically do notchange there are finally 139 time series for modeling +enthere are more than 20000 observations for each sequence ofwhich less than 05 are labeled with alerts by the selfmonitoring system (Table 2)

42 Data Preprocessing We group and average the moni-toring sequences every half hour since the data granularity ofper minute is too small with more noisy Note that at anytime t our aim is to predict the potential failures of eachnode with [t t + H] where H is the time interval defined insection 3 +erefore we label the observation as 1 if there isany failure within [t t + H] or 0 otherwise After labelingthere are 13 of the data with label 1 which does not requireother complicated oversampling techniques [16]

We use the first 10-day data as training sample based onwhich random forest is trained +e remaining 4-day dataare used for testing

+e procedures for classifying xt for one node are asfollows For the above 5th (i 1 2 3 4 5) attribute wegenerate 5 anomaly scores (si1t si5t) according to(xi1 xit) +en the 5 scores are used as input features inthe trained random forest and generate the label of xt

43 Results and Discussion

431 First Layer In the first layer we apply EXPoSE withwindow size 2 days and decay rate 001 on each of the fiveattributes of each node Because there are over 750 scores foreach series we only illustrate the anomaly scores fortraining from November 10 2018 to November 17 2018

4 Complexity

Since our sliding window size is two days the first two daysare not shown in the next figures

In the two figures the X-axis denotes the calendar timeand the Y-axis denotes the scores +e shape of plus denotesthat there is a failure alert and one can find the actual timevia the vertical dashed line +e failure alert in Figure 2 isfrom the same node of five attributes Additionally Figure 3shows all alerts of Elasticsearch component with five attri-butes +rough figures we find some interesting knowledge

(1) +e anomaly score of UJM sequence is high beforethe failure alert which indicates the anomaly de-tector is efficient

(2) Synchronization effect of sequence anomaly scoresattributes like CLU HLC and DIO definitely changesynchronously indicating a certain correlation be-tween these metrics

(3) Failure alerts of other nodes will also be reflected inthe metrics of the current node reference to Figure 3

(4) Also some sequences are not related to alert likeASN in Figures 2 and 3 +ese sequences coulddistract DevOps engineers in reality

+en we calculate correlation matrix between the fiveseries and their anomaly scores in Figures 2 and 3 and il-lustrate them in Tables 3 to 4

It is found that the patterns of score series related toattribute ASN are almost independent from others but the

series related to HLC highly correlate with DIO Suchfindings may indicate a potential topological relationshipbetween the HLC and DIO

432 Second Layer Predicting the failures is actually aclassification problem (failure or nonfailure) thus we em-ploy precision and recall to evaluate the predictabilityWithout loss of generality we treat the observations labeledwith failures as positive class and the remaining as negativeclass +en the random forest produces four types of out-come Two of them refer to correct (or true) classificationincluding true positives (TP) and true negatives (TN)meaning the number of the failure which is correctly clas-sified as positive or negative True negative (TN) and falsenegatives (FP) mean the number of the normal observationis correctly classified as negative or positive in the nextformulas +en the precision and recall are calculated by

precision TP

TP + FP

recall TP

TP + FN

F1 score 2 times precision times recallprecision + recall

(12)

respectively+e final random forest contains 400 decision trees max

depth of the tree is 6 We tested 5 cases as the time interval H

increases from 30 minutes to 4 hours and the results areshown in Table 5

We find the following

(1) +e precisions are all higher than 85 when H is lessthan 35 hour meaning that our model rarely makeswrong prediction Precision perform best (88) whenH equal 3 meaning our method give accurately threehoursrsquo early warning It is significant for DevOpsengineers to focus on the true failure in advance

(2) +e recalls are fairly acceptable Since our labelingcontains more probable failure observations as longas time is within [t t + H]

(3) AS the time interval increases both measures be-come higher which means the model performs

Table 1 Total types of alerts

Alerts type Numbers ofalerts Component Potential failure

ISR has increased in last 15m 37 Kafka IO parallelism is highISR has decreased in last 15m 17 Kafka Data replica is limited which could cause data inconsistencyLeader election has occurred in last15m 8 Kafka New kind of data or maybe kafka leader fails

Kafka health check failed 17 Kafka Kafka healthy problemIP port detection abnormal 8 All Newwork problemEs cluster health unassigned shardsiquest0 18 Elasticsearch Slave shards that are not assigned indicating that the health of the

cluster has deterioratedEs cluster health number of pendingtasksiquest100 9 Elasticsearch Task stacking problem

Disk IO response time is long 26 Elasticsearch System latency is large

Table 2 Total number of failures

Node Number of failures Failure ratio ()1 17 00852 17 00853 13 00654 13 00655 8 0046 6 0037 4 0028 4 0029 4 00210 2 00111 2 001Total 90 045

Complexity 5

10

08

06

04

02

00

Ano

mal

y sc

ore

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert

Figure 3 Results from EXPoSE and all alerts of Elasticsearch

Table 3 Correlations between the five attributes of one node

ASN UJM CLU HLC DIOASN 1000 minus0174 minus0007 minus001 minus001UJM mdash 1000 0065 0076 0076CLU mdash mdash 1000 0987 098HLC mdash mdash mdash 1000 0980DIO mdash mdash mdash mdash 1000

10

08

06

04

02

00

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert

Ano

mal

y sc

ore

Figure 2 Results from EXPoSE and alerts of one node

Table 4 Correlations between the anomaly scores generated byEXPoSE

ASN UJM CLU HLC DIOASN 1000 0009 016 020 0307UJM mdash 1000 minus011 minus0074 minus018CLU mdash mdash 1000 092 0903HLC mdash mdash mdash 1000 0853DIO mdash mdash mdash mdash 1000

6 Complexity

better However when the time interval is too largeit is quite difficult to make preparation for failurerecovery since the actual failure time is hard to belocated

(4) Since the random forest-based model is interpret-able we could derive the feature importance forDevOps engineers to pay more attention on crucialsequences and better improve the monitoring system(Table 6)+e feature importance is Gini importancethe average of impurity decreases over all trees in theforest which reflect how each feature plays an im-portant role in the split

433 Comparisons Although the proposed method per-forms well two questions still remain

(1) Whether or not should we apply anomaly detectionbefore the classification method is applied

(2) Whether or not other anomaly method performsbetter

To answer the first question we remove the anomalydetection layer and directly use the attributes as input fordeveloping a random forest Besides we compare with otherclassical machine learning classification methods (supportvector classification (SVM)) +e results are shown in thefirst three rows of Table 7

Obviously the results in both precision and recall aremuch worse than the two-layer prediction framework+eremay be two reasons First the original data may contain toomuch noise which harms the prediction accuracy Secondthe anomaly detection technique uses not only the currentvalue of the sequences but also the previous values soanomaly scores to some extent summarize the overallperformance of the attributes and mine key point of change+is experiment demonstrates that ensemble learning isnecessary to improve the forecasting accuracy

On the other hand we compare with the LSTM anomalydetector [37] We train LSTM model from normal data ofone week and then predict xt

prime based on the previous threehours to derive anomaly score pt

pt xtprime minus xt

11138681113868111386811138681113868111386811138681113868

xt

(13)

Besides we compare with another real-time detectorNumenta [38] and KNNCAD [39] with the same windowparameters Finally random forest learns all 138 anomalyscore and alerts label (H equals 3 hour) +e results areshowed in the last two rows of Table 7

We find that LSTM performs fairly good from Table 7but the precision is still too low +ough recall is high it is

not important in reality for DevOps engineers due to H

labeling However for KNNCAD the recall is too low toapply Furthermore computation efficiency of expose is 70times faster than LSTM Expose does not even need to trainthe model offline like LSTM and is better for parallelcomputation From another angle our architecture is alsouniversal Different anomaly detectors mean more flexibilityto detect diverse anomalies in reality

5 Conclusion

Predicting the failures of components is of great impor-tance for sustaining business systems of security com-panies and the development of financial market In thispaper we present a new architecture for transformingsystem monitoring data into the predictions of potentialcomponent failures Our method consists of two layers ananomaly detection layer for generating anomaly scoreseries and a classification layer for identifying whether thenode is suspicious to fail Since there are a variety ofeffective anomaly detectors random forest an ensemblelearning approach is employed to utilize and gives play tothe comparative advantages of different detectors +ecase study shows that the method performs very well interms of precision and recall relative to alternativemethods

+ere are a number of promising directions for futureresearch First our method focuses on predicting whetherthe component fail within a time interval but ignores theimportance of lead time of prediction For example cor-rectly identifying failures occurring over (6 00 6 10] at time5 00 is better than the time 5 30 because the former oneprovides more time to prepare for potential problems In-corporating the lead time in loss function of the classificationproblem is a valuable next step In a different direction onemay consider how to extend our method by considering thetopological relationship between different nodes of the samecomponent to find the root cause of failure +ere may be apass-through effect that one failure may cause another

Table 6 Top five important features

Nodes Feature name Feature importance ()1 CLU 2272 ASN 2243 ASN 2184 CLU 2185 UJM 191

Table 7 Anomaly detector comparation

Model Precision Recall f1_scoreExpose RF 088 061 073RF 020 001 011SVM 001 004 002LSTM 060 091 065Numenta 064 039 048KNNCAD 095 004 008

Table 5 Failure prediction results with interval H

MeasuresPrespecified H(h)

1 2 25 3 35 40Recall 040 048 054 061 063 064Precision 090 087 089 088 080 067f1_score 060 062 067 073 071 065

Complexity 7

Besides we also expect researchers to use more data evenfrom other industries to validate our method

Data Availability

+e data used to support the findings of this study areavailable from the corresponding the author upon request

Conflicts of Interest

+e authors declare that there are no conflicts of interest

Authorsrsquo Contributions

F L Y F and J Z conceptualized the study X W wasinvolved in the methodology X W was responsible for thesoftware XW and J Z validated the study Y F curated thedata F L wrote reviewed and edited the original draft J Zwas involved in the funding acquisition

Acknowledgments

+is work was supported by the National Natural ScienceFoundation of China (Grant no 71732003) and the Fun-damental Research Funds for the Central Universities(Grant no 14380041)

References

[1] A S Tanenbaum and M Van Steen Distributed SystemsPrinciples and Paradigms Prentice-Hall Upper Saddle RiverNJ USA 2007

[2] L M Silva J Alonso and J Torres ldquoUsing virtualization toimprove software rejuvenationrdquo IEEE Transactions onComputers vol 58 no 11 pp 1525ndash1538 2009

[3] R Atat L Liu J Wu et al ldquoBig data meet cyber-physicalsystems a panoramic surveyrdquo 2018 httpsarxivorgabs181012399

[4] A Telesca F Carena W Carena et al ldquoSystem performancemonitoring of the alice data acquisition system with zabbixrdquoJournal Of Physics Conference Series vol 513 no 6 Article ID62046 2014

[5] D Liu Y Zhao H Xu et al ldquoOpprentice towards practicaland automatic anomaly detection through machine learningrdquoin Proceedings Of the Internet Measurement ConferenceTokyo Japan October 2015

[6] Y Chen R Mahajan B Sridharan and Z-L Zhang ldquoAprovider-side view of web search response timerdquo ACMSIGCOMM Computer Communication Review vol 43 no 4pp 243ndash254 2013

[7] V Chandola A Banerjee and V Kumar ldquoAnomaly detec-tion a surveyrdquo ACM Computing Surveys (CSUR) vol 41no 3 p 15 2009

[8] S Agrawal and J Agrawal ldquoSurvey on anomaly detectionusing data mining techniquesrdquo Procedia Computer Sciencevol 60 pp 708ndash713 2015

[9] X Xu H Liu and M Yao ldquoRecent progress of anomalydetectionrdquo Complexity vol 2019 Article ID 268637811 pages 2019

[10] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[11] S He Q Lin J Lou et al ldquoIdentifying impactful servicesystem problems via log analysisrdquo in Proceedings of the ACMJoint Meeting on European Software Engineering Conferenceand Symposium on the Foundations of Software EngineeringESECSIGSOFT Lake Buena Vista FL USA November 2018

[12] M Du F Li G Zheng et al ldquoDeeplog anomaly detection anddiagnosis from system logs through deep learningrdquo in Pro-ceedings of the ACM SIGSAC Conference on Computer andCommunications Security Dallas TX USA October 2017

[13] Q Lin J Lou H Zhang et al ldquoidice problem identificationfor emerging issuesrdquo in Proceedings of the 38th InternationalConference on Software Engineering ICSE Austin TX USAMay 2016

[14] A Brown A Tuor B Hutchinson et al ldquoRecurrent neuralnetwork attention mechanisms for interpretable system loganomaly detectionrdquo 2018 httpsarxivorgpdf180304967pdf

[15] N El-Sayed H Zhu and B Schroeder ldquoLearning from failureacross multiple clusters a trace-driven approach to under-standing predicting and mitigating job terminationsrdquo inProceedings of the 37th IEEE International Conference onDistributed Computing Systems ICDCS Atlanta GA USAJune 2017

[16] Q Lin K Hsieh Y Dang et al ldquoPredicting node failure incloud service systemsrdquo in Proceedings of the ACM JointMeeting on European Software Engineering Conference andSymposium on the Foundations of Software EngineeringESECSIGSOFT FSE Lake Buena Vista FL USA November2018

[17] H Xu W Chen N Zhao et al ldquoUnsupervised anomalydetection via variational auto-encoder for seasonal kpis in webapplicationsrdquo 2018 httpsarxivorgabs180203903

[18] N Zhao J Chen Z Wang et al ldquoReal-time incident pre-diction for online service systemsrdquo in Proceedings of the ESECFSE rsquo20 28th ACM Joint European Software EngineeringConference and Symposium on the Foundations of SoftwareEngineering Virtual Event USA November 2020

[19] J Liu J Zhu S He et al ldquoLogzip extracting hidden structuresvia iterative clustering for log compressionrdquo in Proceedings ofthe 34th IEEEACM International Conference on AutomatedSoftware Engineering San Diego CA USA November 2019

[20] Y Li Z M J Jiang H Li et al ldquoPredicting node failures in anultra-large-scale cloud computing platform an aiops solu-tionrdquo ACM Transactions on Software Engineering andMethodology vol 29 no 2 p 13 2020

[21] M Hejazi and Y P Singh ldquoOne-class support vector ma-chines approach to anomaly detectionrdquo Applied ArtificialIntelligence vol 27 no 5 pp 351ndash366 2013

[22] J Wang P Zhao and S C Hoi ldquoCost-sensitive onlineclassificationrdquo IEEE Transactions on Knowledge and DataEngineering vol 26 no 10 pp 2425ndash2438 2013

[23] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsfor mining outliers from large data setsrdquo ACM SIGMODRecord vol 29 no 2 pp 427ndash438 2000

[24] H Du S Zhao D Zhang et al ldquoNovel clustering-basedapproach for local outlier detectionrdquo in Proceedings of theIEEE Conference on Computer Communications WorkshopsSan Francisco CA USA April 2016

[25] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFrdquoACM SIGMOD Record vol 29 no 2 pp 93ndash104 2000

[26] R Chalapathy and S Chawla ldquoDeep learning for anomalydetection a surveyrdquo 2019 httpsarxivorgabs190103407

[27] Q Yu L Jibin and L Jiang ldquoAn improved arima-based trafficanomaly detection algorithm for wireless sensor networksrdquo

8 Complexity

International Journal of Distributed Sensor Networks vol 12no 1 Article ID 9653230 2016

[28] S Arora and J W Taylor ldquoShort-term forecasting ofanomalous load using rule-based triple seasonal methodsrdquoIEEE Transactions on Power Systems vol 28 no 3pp 3235ndash3242 2013

[29] S Aditham N Ranganathan and S Katkoori ldquoLstm-basedmemory profiling for predicting data attacks in distributed bigdata systemsrdquo in Proceedings of the IEEE International Par-allel and Distributed Processing Symposium Workshops(IPDPSW) Orlando FL USA May 2017

[30] A Bernieri G Betta and C Liguori ldquoOn-line fault detectionand diagnosis obtained by implementing neural algorithmson a digital signal processorrdquo IEEE Transactions on Instru-mentation andMeasurement vol 45 no 5 pp 894ndash899 1996

[31] R P Adams and D J C Mackay ldquoBayesian onlinechangepoint detectionrdquo 2007 httpsarxivorgabs07103742

[32] M Schneider W Ertel and G Palm ldquoExpected similarityestimation for large scale anomaly detectionrdquo in Proceedingsof the International Joint Conference on Neural NetworksKillarney Ireland July 2015

[33] A Rahimi and B Recht ldquoWeighted sums of random kitchensinks replacing minimization with randomization in learn-ingrdquo in In Advances in Neural Information Processing Systems21 D Koller D Schuurmans Y Bengio et al Eds CurranAssociates Inc Red Hook JY USA 2009

[34] J Gama I Zliobaite A Bifet et al ldquoA survey on concept driftadaptationrdquo ACM Computing Surveys vol 46 no 4 pp 1ndash442014

[35] R Ding Q Fu J Lou et al ldquoMining historical issue repos-itories to heal large-scale online service systemsrdquo in Pro-ceedings of the 44th Annual IEEEIFIP InternationalConference on Dependable Systems and Networks AtlantaGA USA June 2014

[36] J-G Lou Q Lin R Ding Q Fu D Zhang and T XieldquoExperience report on applying software analytics in incidentmanagement of online servicerdquo Automated Software Engi-neering vol 24 no 4 pp 905ndash941 2017

[37] K Greff R K Srivastava J Koutnık B R Steunebrink andJ Schmidhuber ldquoLSTM a search space odysseyrdquo IEEETransactions on Neural Networks and Learning Systemsvol 28 no 10 pp 2222ndash2232 2017

[38] A Lavin and S Ahmad ldquoEvaluating real-time anomaly de-tection algorithmsndashthe numenta anomaly benchmarkrdquo inProceedings of the IEEE 14th International Conference onMachine Learning and Applications (ICMLA) Miami FLUSA December 2015

[39] V Ishimtsev A Bernstein E Burnaev et al ldquoConformal k-nnanomaly detector for univariate data streamsrdquo Conformal andProbabilistic Prediction and Applications vol 213ndash227 2017

Complexity 9

of a given component by using a real-time unsupervisedanomaly detector A higher score indicates a higheranomalous probability of the current state of the sequenceBased on anomaly scores the second layer employs randomforest one of the successful ensemble classification methodsto predict whether nodes will fail within a given timeinterval

+e main contributions of this study are as follows Firstit describes a real-time efficient and applicable method toidentify whether the components of distributed systemswould fail or not in the future based on the self-monitoringdata +e online method is essential for industrial big datascenario such as the Internet of +ings (IoT) in the future[10] We apply the methodology on the data provided by oneof the largest security companies in China According to atest on 6-day data the precision is over 088 and the recallcould be as high as 061 with proper forecasting interval time+e results demonstrate the predictability of our method Inthat this paper enriches the study on online failure pre-diction for distributed systems Second we take advantage ofensemble learning to leverage the performance of anomalydetector and give some IT operation suggestions from ourexplainable model for DevOps engineers Compared withresults using only one detector in the first layer ensemblelearning provides higher prediction accuracy and moreknowledge discovery

+e reminder of paper is organized as follows Section 2gives a brief review on anomaly detection and fault pre-diction A detailed description of the proposed predictionframework is presented in Section 3 followed by a case studywith discussion in section 4 Finally Section 5 includesseveral general conclusions and directions for future work

2 Literature Review

Many previous works focus on failure prediction and di-agnose based on text data such as logs reported issues andwarnings [11ndash14] However due to the irregularities of textdata these methods build complex models for betterperformance

Nowadays more researchers directly analyse temporalmonitoring data Nosayba El-Sayed et al identified relativefactors of unsuccessful jobs from log and then designed arandom forests framework learning form configurationmessage and monitoring metrics to predict the unsuc-cessful termination of a job [15] He et al proposed Log3C[11] a novel cascading clustering algorithm for log se-quences and then identified the impactful problems bycorrelating the clusters of log sequences with system KPIsLin et al introduced a failure prediction technique MING[16] which combines a LSTM model incorporating thetemporal data with a random forest model consumingspatial data Haowen Xu et al proposed Donut [17] anunsupervised anomaly detection algorithm based on VAEfor Seasonal KPIs Zhao et al proposed eWarn [18] whichextracts a set of features (including textual features andstatistical features) to train classification models (XGBoost)with alert However there are tens of TBs of logs andthousands of metrics per day for a large-scale distributed

system in reality meaning it is a great challenge to selectuseful features train and update a prediction model [19]Besides some black-box ML and DL models like LSTMandMING-based models is hard to explain the results +elack of decision interpretability is not convenient formaintenance work of DevOps engineers [20] +ereforethis paper calculate the monitoring time series anomalyscore in real-time which is automatic feature engineeringto build a random forest-based prediction model in thesecond layer

+ere is extensive work on anomaly detection methodsthat can be mainly categorized into classification-basedstatistics-based like distanced-based [21ndash24] density-based[25] nonlinear model based [26] and so on But most ofthese approaches are required the full dataset and not un-suitable for real-time streaming sequences For sequencedata the length increases as time goes Some methods are tomake one-step ahead forecast and defines the observationswith relative large prediction errors as anomalies +eprediction methods that have been adopted include auto-regressive integrated moving average (ARIMA) [27] ex-ponential smoothing [28] long short-term memory (LSTM)[29] and so on +e s-based methods like extreme stu-dentized deviate k-sigma median absolute deviation [30]and Bayesian online detection [31] are also applied for se-quence data with proper selection of time window Ouranomaly detection method is also this genre But whatdistinguishes is that we chose a kernel-based method whichmake no assumption about the data distribution and score sofast only in constant time

3 Prediction Framework

Let Xit (xi1 xi2 xit) i 1 2 p t 1 2 denotethe monitoring data for ith attribute of one certain nodeuntil time t+en themonitoring data for p attributes of onenode forms multidimensional sequences denoted by Xt and

Xt



⋮ ⋮ ⋱ ⋮

xp1 xp2 middot middot middot xpt

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(1)

In this section we construct a two-layer predictionframework which transforms Xt into the classification ofwhether the node will fail within time interval (t t + H]where H is the prespecified length of time as is shown inFigure 1 +e tow-layer architecture consists of one anomalydetection layer and one classification layer

+e anomaly detection layer contains a real-timeanomaly detector EXPoSE [32] For each monitoring se-quence the detector generates anomaly score vectorSit (si1 si2 sit) i 1 2 p where sit quantifies theanomalous level of the ith sequence at time t +e higher theanomaly scores the higher the probability that the sequencesare anomalous +ese scores are then used as input for thesecond layer where a classificationmodel is built+e detailsof both layers are presented below

2 Complexity





μ Pt1113858 1113859 1113946X

k xt middot( 1113857dP xt( 1113857 (3)


1113954μ Pt1113858 1113859 1t

1113944

t

i1ϕ xt( 1113857 (4)


1113954ϕ xt( 1113857 1N









Detector


Input sequences


Complexity 3


wt 1l

1113944

t









k1pklog2 pk( 1113857 (9)



v1

Dv

DEnt Dv( 1113857 (10)



Gain(D a) (11)





4 Case Study








4 Complexity











precision TP

TP + FP

recall TP

TP + FN


(12)
















Complexity 5

10

08

06

04

02

00

Ano

mal

y sc

ore

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert




10

08

06

04

02

00

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert

Ano

mal

y sc

ore




6 Complexity










pt xtprime minus xt

11138681113868111386811138681113868111386811138681113868

xt

(13)





5 Conclusion










Complexity 7


Data Availability






Acknowledgments


References




























8 Complexity














Complexity 9





μ Pt1113858 1113859 1113946X

k xt middot( 1113857dP xt( 1113857 (3)


1113954μ Pt1113858 1113859 1t

1113944

t

i1ϕ xt( 1113857 (4)


1113954ϕ xt( 1113857 1N









Detector


Input sequences


Complexity 3


wt 1l

1113944

t









k1pklog2 pk( 1113857 (9)



v1

Dv

DEnt Dv( 1113857 (10)



Gain(D a) (11)





4 Case Study








4 Complexity











precision TP

TP + FP

recall TP

TP + FN


(12)
















Complexity 5

10

08

06

04

02

00

Ano

mal

y sc

ore

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert




10

08

06

04

02

00

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert

Ano

mal

y sc

ore




6 Complexity










pt xtprime minus xt

11138681113868111386811138681113868111386811138681113868

xt

(13)





5 Conclusion










Complexity 7


Data Availability






Acknowledgments


References




























8 Complexity














Complexity 9


wt 1l

1113944

t









k1pklog2 pk( 1113857 (9)



v1

Dv

DEnt Dv( 1113857 (10)



Gain(D a) (11)





4 Case Study








4 Complexity











precision TP

TP + FP

recall TP

TP + FN


(12)
















Complexity 5

10

08

06

04

02

00

Ano

mal

y sc

ore

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert




10

08

06

04

02

00

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert

Ano

mal

y sc

ore




6 Complexity










pt xtprime minus xt

11138681113868111386811138681113868111386811138681113868

xt

(13)





5 Conclusion










Complexity 7


Data Availability






Acknowledgments


References




























8 Complexity














Complexity 9











precision TP

TP + FP

recall TP

TP + FN


(12)
















Complexity 5

10

08

06

04

02

00

Ano

mal

y sc

ore

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert




10

08

06

04

02

00

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert

Ano

mal

y sc

ore




6 Complexity










pt xtprime minus xt

11138681113868111386811138681113868111386811138681113868

xt

(13)





5 Conclusion










Complexity 7


Data Availability






Acknowledgments


References




























8 Complexity














Complexity 9

10

08

06

04

02

00

Ano

mal

y sc

ore

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert




10

08

06

04

02

00

2018

-11-

10

2018

-11-

11

2018

-11-

12

2018

-11-

13

2018

-11-

14

2018

-11-

15

2018

-11-

16

2018

-11-

17

2018

-11-

18

2018

-11-

19

Date

ASNUJMHLC

CLUDIOFailure alert

Ano

mal

y sc

ore




6 Complexity










pt xtprime minus xt

11138681113868111386811138681113868111386811138681113868

xt

(13)





5 Conclusion










Complexity 7


Data Availability






Acknowledgments


References




























8 Complexity














Complexity 9










pt xtprime minus xt

11138681113868111386811138681113868111386811138681113868

xt

(13)





5 Conclusion










Complexity 7


Data Availability






Acknowledgments


References




























8 Complexity














Complexity 9


Data Availability






Acknowledgments


References




























8 Complexity














Complexity 9














Complexity 9

atwo-layerarchitectureforfailurepredictionbasedon high

Documents