livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/draft-3.docx  · web viewhe...

19
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID Scalable Contextual Anomalous Event Stream Detection Using Direct Acyclic Graph Model Bakhtiar Amen, Grigoris Antoniou, and Violeta Holmes Abstract— Processing and mining high volume and high velocity of data stream is becoming to a new reseach challenging task in the field of big data and stream mining. Specifically, detecting anomaly from high volumes of streaming data is a critical concern in most of the real time applications before an important anomalous event can be dismissed or disregarded. In this paper, a novel distributed anomaly detection method and algorithm are proposed to detect contextual behaviours from sequence of events in real time. The proposed solution is mainly based on; firstly, capturing event streams and partitioning them over several windows to control the high rate of event streams. Secondly, proposed distributed detection algorithm to detect contextual anomalous event. Third, the experimental results are evaluated based on; the algorithm’s performances, high throughput processing low-latency response, and detecting contextual anomalous accuracy rate. Appoperiate computational metrics have been proposed to measure and evaluate the processing latency of distributed method. However, the emperical result is evidenced, the effectiveness of the distributed contextual event stream detection and hightroughput of computational accuracy. Index Terms—Contexual Detection, Anomaly Detection, Event Detection, Big Data. —————————— —————————— 1 INTRODUCTION he age of big digital data is emerged and high volume of data is generating through many Internet of Things (IoT) and Internet of Everything (IoE) objects in a high veracity rate. Importanly, learning and detecting from these high volume of contious gerenarting is data playing an important role in many dynamic applications. In recent years, prior to the scalability and processing of infinite streams in real time, discovering hidden knowledge or predicting anomalous event has attracted the attentions of resecher in the filed of machine learning, data stream T mining and big data analytic (Duarte et al., 2016; De Francisci Morales, 2016; Bifet et al., 2016) (Grosse & Turin, 2012) (Zhang, 2013) (Amen & Lu, 2015; Candela et al., 2009; Grosse & Turin, 2012). Importantly, the term of “anomaly” is differed from one discipline to another, this is due to the nature of the problem. Hence, anomaly, outlier and novelty are correlated, while these terms are depending on the application domain specific based on different factors; anomaly type, data object, and output result (Candela et al., 2009). For example, in network traffic monitoring, anomaly is considered as an intruder. Similarly, in financial and banking industry, frauds and suspicious activity are considered as an anomaly or event. Detecting user’s opinion or behaviour from writing inappropriate comments (e.g., race and sexual abuse, arranging riot activities, online activities including terror and criminal threats) from social media data is associated with topic detection (Yu & Lan, 2016). Simlarly, in other applications such as highway road traffic monitoring system, airport surveillance, medical diagnosis, civil xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society ———————————————— Bakhtiar Amen is with the Department of Computer Science at the University of Liverpool, Liverpool, L69 3BX. E-mail: [email protected]. Grigoris Antoniou is with the Department of Computing and Informatics, University of Huddersfield, Huddersfield, HD1 3BZ. E- mail: [email protected]. Violeta Homles is with the Department of Engineering and Technology, Huddersfield, HD1 3BZ. E-mail: [email protected] ***Please provide a complete mailing address for each author, as this is the address the 10 complimentary reprints of your paper will be sent

Upload: others

Post on 08-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

Scalable Contextual Anomalous Event Stream Detection Using Direct Acyclic Graph

Model Bakhtiar Amen, Grigoris Antoniou, and Violeta Holmes

Abstract— Processing and mining high volume and high velocity of data stream is becoming to a new reseach challenging task in the field of big data and stream mining. Specifically, detecting anomaly from high volumes of streaming data is a critical concern in most of the real time applications before an important anomalous event can be dismissed or disregarded. In this paper, a novel distributed anomaly detection method and algorithm are proposed to detect contextual behaviours from sequence of events in real time. The proposed solution is mainly based on; firstly, capturing event streams and partitioning them over several windows to control the high rate of event streams. Secondly, proposed distributed detection algorithm to detect contextual anomalous event. Third, the experimental results are evaluated based on; the algorithm’s performances, high throughput processing low-latency response, and detecting contextual anomalous accuracy rate. Appoperiate computational metrics have been proposed to measure and evaluate the processing latency of distributed method. However, the emperical result is evidenced, the effectiveness of the distributed contextual event stream detection and hightroughput of computational accuracy.

Index Terms—Contexual Detection, Anomaly Detection, Event Detection, Big Data.

—————————— ——————————

1 INTRODUCTIONhe age of big digital data is emerged and high volume of data is generating through many In-

ternet of Things (IoT) and Internet of Everything (IoE) objects in a high veracity rate. Importanly, learning and detecting from these high volume of contious gerenarting is data playing an important role in many dynamic applications. In recent years, prior to the scalability and processing of in-finite streams in real time, discovering hidden knowledge or predicting anomalous event has at-tracted the attentions of resecher in the filed of machine learning, data stream mining and big data analytic (Duarte et al., 2016; De Francisci Morales, 2016; Bifet et al., 2016) (Grosse & Turin, 2012) (Zhang, 2013) (Amen & Lu, 2015; Candela et al., 2009; Grosse & Turin, 2012). Importantly,

T the term of “anomaly” is differed from one disci-pline to another, this is due to the nature of the problem. Hence, anomaly, outlier and novelty are correlated, while these terms are depending on the application domain specific based on different factors; anomaly type, data object, and output re-sult (Candela et al., 2009). For example, in net-work traffic monitoring, anomaly is considered as an intruder. Similarly, in financial and banking in-dustry, frauds and suspicious activity are consid-ered as an anomaly or event. Detecting user’s opinion or behaviour from writing inappropriate comments (e.g., race and sexual abuse, arranging riot activities, online activities including terror and criminal threats) from social media data is as-sociated with topic detection (Yu & Lan, 2016). Simlarly, in other applications such as highway road traffic monitoring system, airport surveil-lance, medical diagnosis, civil security, engineer-ing, weather broadcast, detecting and predicting unsual actitvies are considered as an oulier or anomaly, since in each of these application, the concept of unsual event is critical concern interms of safe prospective (Gupta et al., 2014) (Grosse & Turin, 2012).

The challenging of anomaly detection in big data stream analytic palys an important role in many real time applications. Since traditional anomaly detection methods are mainly capable to detect anomaly over;

xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society

————————————————

Bakhtiar Amen is with the Department of Computer Sci-ence at the University of Liverpool, Liverpool, L69 3BX. E-mail: [email protected].

Grigoris Antoniou is with the Department of Computing and Informatics, University of Huddersfield, Huddersfield, HD1 3BZ. E-mail: [email protected].

Violeta Homles is with the Department of Engineering and Technology, Huddersfield, HD1 3BZ. E-mail: [email protected]

***Please provide a complete mailing address for each author, as this is the address the 10 complimen-tary reprints of your paper will be sent

Page 2: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

2 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. #, NO. #, MMMMMMMM 1996

a) limited size of static (batch) data based on an offline learning method rather than online learn-ing from streaming data

b) centralised detection with limited computing resource rather than distributed, however, the benfiti of distributed data processing is high-through data, less processing time and computa-tional lower-latency resonse,

c) consider either point or collective type rather than contextual (not (Jo ao Duarte , 2014) (Grosse & Turin, 2012; Gupta et al., 2014; Ma et al., 2016). Since contextual event detection is sig-nificantly important in real world application to define an event condition caused by either system or people (see section two for contextual anomaly description) (ref). Thus, learning and detecting from streaming data in real time requires a very low latency response and adaptive algorithm be-fore event is disregarded or becomes irrelevant. For example, anomalous event occurred at 6:00 am is irrelevant when it arrives at 7:00 am. This paper focuses on detecting contextual anomalous events from high volumes of streams, through parallel and distributed pradigm. In this paper, we proposed a novel distributed contextual anoma-lous event stream detection algorithm by adopting an online learning to control the velociety of the streams based on window partitions. The experi-mental results have been evaluated based on the algorithm’s performances, processing low-latency response, and detecting contextual anomalous be-haviour accuracy rate from high volume of event streams.

The reminder of this paper is organised as fol-lows; Section 2 describe the anomaly detection characstrics and methods. Reseach realted dis-tributed anomaly detection is presented in Sec-tion 3. Section 4 describes the reseach problem of contextual event stream noations and defitnions. The contextual anomalous event stream DAG model is described in Section 5. Section 6 de-scribes the empirical result, discussion and evalu-ations of the proposed method and algorithm fol-lowed by Section 7 colcusion and future work.

2 ANOMALY DETECTION Anomaly detection is categorised into three types of point (A), collective (B) and contextual (C) as depicts in Fig. 1 in highway road traffic stream domain. Scienario A refers to point anomaly, since an event in the stream sequence is occurred at t5. Point anomaly is considered as one of the most common approaches in many application domains. Several events have been occurred in the stream sequence by a group of vehicles with exceeded speed of 140mp/h at t4 to t6, thus, these events are considered as collective anomaly. Lastly, contex-

tual anomaly is defined as behaviour of the data in a specific context. In another word, In [1] and [2] argued that in streaming application, contextual anomaly refers to attribute which determine the context stream. Similarly, In [26] argued that con-textual anomaly is associated with an event with occurance time in a different context. Scenario C can be considered as contextual anomaly, since similar event is occurred in different context or time. For example, exceeded speed of 140mp/h at 8:00am in t3 has occurred in a different conext at 11:00pm. In streaming application, contextual anomaly can be defined as an event value with an occurrence as attributes in stream sequence (Can-dela et al., 2009), (Duarte et al., 2016).

The context of event is playing an impoarnt role in many real time application to define the cause and occurance time of the event. In another word, people or system can behave normal in most of the situation, while differently in another context. Consider a shopping spends behavior in Decem-ber during the Christmas period, people is possi-bly spend more money on shopping than January period. In financial banking sector, spending $1000 in December is considered as normal, while similar amount is considered as anomaly in January.

Fig. 1. Anomaly detection types for temporal data.

In other situation such as online shopping, con-textual anomaly can be used to detect online cus-tomer’s behaviour, since customer’s shopping be-

Page 3: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

AUTHOR: TITLE

havior (interest) from one season to another is changing (Jiang et al., 2014).

The contextual anomaly detection methods have been proposed to predict stock market shares (Golmohammadi & Zaiane, 2015), social networks behaviours between different group of users (Akcora et al., 2014), sensor network pat-tern detection (Hayes & Capretz, 2015), text data and semantic analysis (Mahapatra et al., 2012). Several studies including [3] and [4] stated that contextual anomaly is the most appropriate method to detect contextual behaviour form streaming data. Hence, research on contexual anomaly detection from streaming data is limited, since the majority of the aforementioned reseachs are either focused on a specific type of point (ref) or collective (ref) rather than contextual anomaly detection.

3 RELARED WORKMany researche studies have been proposed to address, finding or predicting anomaly from static data in an offline fashion. Hence, data-based and task-based are the most windley proposed meth-ods in the filed of machine larning and data stream mining [5]. Data-based method is primarly relying on learn from subset of the data for fur-ther analysis. Sketch [6], load shedding [7], sam-pling [7], synopsis [8], aggregations [9] and are examples of this method. On the contraty, task-based method is associated with either develop or enhance existing method, hence, window algo-rithm, apporimasation algorithms are most com-mon of this approach. The disadvantage of data-based technique is that when data continuously arrive at a very high rate alike streams, intelligent actionable decisions are required before the event stream is discarded or neglected. This tech-nique is appropriate for the static learning rather than for streaming data [10]. In recent years, dis-tributed data miniding has emerged as alternative solution to address some of data-based and task-based drawbacks [11]. Thus, task-based method is highlighted as an appoperiate method for many machine learning data mining tasks.

Classification method based on either one-classs or multi-classs data label have been pro-posed in (Perkins 2003), OReilly et al. (2013), (Schneider, Ertel et al., 2016), (Hoens, Polikar et al., 2012) to train model to classiefy anomaly or outlier from the dataset. These methos are mainly based on train the model based on either normal or abnormal class labels (Aggarwal 2007). A ma-jor limitation of supervised learning approach is a class label, thus, this is a major brawback to have in the streaming data. To train any classifier model, a prior knowledge of the data label is re-

quired (x, y) (Faria et al., 2016). Chandola, Baner-jee et al., in (2009) argued that manual labelling is very time consuming, complex, and very expen-sive process. In recent years, alternative ap-proach of scoring data instances or window parti-tions based on the data record is proposed. The scoring technique over sequenctial data is wind-ley studied in (Chandola, Banerjee et al., 2009; Chandola, Banerjee et al. 2012; Zhang, 2013). Several reseach studies in Aggarwal (2016) Zhang (2013), and Faria et al. (2013) argued that, the ac-curacy of scoring results of anomaly rate are more achievable rather than the outlier labelling result in many applications. This is due to the clear understanding of anomaly objectives in each of these application domain.

The most common method of classification is a tree-based decision method (e.g., bagging and boosting decision tree, random forest, C4.5 deci-sion tree and boosted stump), rule-based, Support Vector Machine (SVM), and Neural Network (NN) (Chandola, Banerjee et al., 2009).. The algorithm interprets data into a tree-based learning process of hierarchical partitioning and each partition within the tree acts as independent node. The tree procedure is based on a common assumption of top-down approach learning where the tree de-velops from top to the root.

Unsupersived learning algorithm in (ref) is ar-gued to be another appropriate method to detect event streams. Clustering-based method is one of the alternative powerful meta-learning technique to analyse high volumes of data created by ad-vanced applications. Clustering methods such as partitioning methods, hierarchical methods, den-sity-based methods, grid-based methods, and model-based (Amini et al., 2014; Yang & Fong, 2015) and (Fahad, Alshatri et al., 2014) have been extensively studied and conducted in many stream mining data; micro-blogging (Lee & Chien, 2013), web analytics (Facca & Lanzi, 2005).

(Zhang et al., 1997) proposed distributed clus-tering algorithm so-called Balanced Iterative Re-ducing and Clustering using Hierarchies (BIRICH). The main data structure for this algo-rithm is based on Cluster Feature (CF) concept and CF-tree method to summarise the data streams into CF data structure. BRITCH splits leaf node of CF-tree and any CF vector with low density is considered to be an outlier or anomaly. While CF is constructed from d-dimensional data point in the cluster. Splitting cluster □({→┬X}) is based on i = 1,2,3, ....., N, and CF vector of the cluster, while the splitting criterion is mainly de-pending on data structure triple of CF according to cluster measurements from: centroid, radius, and diameter.

Page 4: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

4 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. #, NO. #, MMMMMMMM 1996

The concept of CF is proposed in another dis-tributed clustering algorithm so-called DenStream by (Charu C. Aggarwal 2003). DenStream stands for density-based algorithm, similar to BRITCH, DenStream proposes CF data structure with two additional p-microclusters and o-microclusters pa-rameters. The algorithm is constructed based on, Tp DenStream and checks for p-microclusters to find a possible outlier from o-microcluster. A de-tailed DenStream algorithm extension is de-scribed in (Feng Cao 2006).

Clustream algorithm an extension of Micro-cluster and the algorthim’s data structure is based on two concepts of (online and offline) learnings. First, the statistical summary of the data stream is stored on member and maintained by microclusters, then the input summary of data as captured on the online phase can be trained and tested in an offline-based. The proposed algo-rithm computes maximum microcluster boundary based on the standard deviation of mean distance from the cluster centroid according a factor f. Any streams outside the microcluster boundary is con-sidered as an outlier. This is based on computes newly arrived data stream instance with two near-est microclusters based on their measured dis-tance.

Chandola, Banerjee et al. (2009) argued that clustering-based method is mainly appropriate to organise data into group of data instances rather than finding or detecting anomalies. For example, in dynamic application scenarios, it is impractical to group k sample of streaming data (Erfani et al., 2016). In distributed clustering including in Den-Stream and Clustream the learning process is mainly based on CF data structure of online and offline. While (Schneider, Ertel et al., 2016) ar-gued that the disadvantage of clustering is a com-putational of outlier score result and complexity of distance between k nearest neighborare.

4 PROBLEM NOTATIONS4.1 Data Stream Model Stream is a sequence of either elements, items, instances, or bjects. In this section, we describe the basic stream notation and stream model, event stream definitions, and contextual nota-tions.

Definition 1 (Tuple). Stream’s data structure is so-called tuple, a list of data with attribute-value pairs in a form of ⟨x, t⟩ schema with formalised explicit timestamp t.

Definition 2 (Time). Is associated with the oc-currence of the event streams. In Streamming data, time is very important before the anoma-louse event is becooming to irrelevant or disre-

garded.

Definition 3 (Stream). Stream S is unbounded sequence of tuples ordered by timestamp t which can be denoted as {t1, t2, …, tn} as shown Equa-tion 1.

S = {s1, s2, …, sn }

(1)

In some of the real world applications, stream is resented in only a single symbol without any data attribute or value such as, “s1” and “s2”, or sen-sors signals from s1 → s2. In the

data driven paradigm, event is required to be comprised of data type and value to construct an event from. Thus, in this research, we desinged event model based on the sequence of stream tu-ple key-value behaviours.

4.2 Event Stream ModelDefinition 4 (Event Tuple). Anomaly in this re-seach dentoted as an event e based on tuple (ei, ti) while ti is associated with event occurrence time t ∈ T.

Definition 5 (Event Stream). Event stream S can be denoted as sequence of events {e1, e2, .., ei} with each tuple record as denonted in Equa-tion 2.

S = {(e1, t1), (e2, t2), …, (en, tn) }

(2)

When Si is consisted an infinit i-th number of event tuples. To process these event streams in parallel paradigm, key-value pair is one of the most appoperiate method to partition data over different node based on their key and value at-tributes. In this context, each event can be con-structed from key-value <key, value> pair as ⟨k, [v, t] ⟩ tuple in the sequence. Key (k) is associated with event order number, (v) for the value of event (v), and (t) is the event timestamp occu-rance. Consider a road traffic stream data sce-nario when a tuple can be defined < e1, [v1 , t1]>

Page 5: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

AUTHOR: TITLE

in Fig. 2.

Definition 6 (Event Time Order). Time order-ing in stream processing plays a significant role to distinct between implicit and explicit timestamps of the events. In many real-world applications, several events can occur together; thus, the com-position ∪ of two events can be constructed from the time-based sequence tuples in event stream processing as denoted in Equation 3.

Si (e1, e2) → (e1, t1) ^ (e2, t2) ^ t1 ≤ t2 ^

e1, e2 ∈ w (3)

When event e1 at t1 sets to arrive before event e2 at t2 based on their tuple timestamp sequence order. This can be achived during the window par-tition process. In this context, event streams are representing in a list of finite sequences of events with timestamp, while e consist of value and time-stamps as depicts in Fig. 2. The main benefit of event time order is to identify the time of event which has occurred and provide to semantically computational results. However, this protects events from been dismissed or disregarded during the processing time and mining phases.

4.3 Contexual Event Stream Model Contextual event can be defined as an anomlous event in temporal sequence of stream per each window partition. Such behaviour is referred to sequential analysis in statistical analysis. Since event streams are seuqnetialy order, conetxual event stream can be dentoned as tuple with differ-ent value with different context or time.

Definition 8 (Contextual Event Stream). Con-textual Anomaly can be denoted as CA for anoma-lous event stream tuples per window partition w.

Example 4. Consider events from Si based on the definition 2, when each event is constructed from tuple ⟨ k, [v, t ]⟩ format. The first tuple refers to an event number in the S1 (e.g., e1), and timestamp of the arrived event with the normal value (speed event).

Fig. 2. Sequence of event stream tuples.

Fig. 3. Contexual event stream tuples per window in different context when w1 respesents at t = am and smiliar w1 repesent at t =pm partition event

tuples.

The window partition can be used to collect the events from the sensor streams within the speci-fied time interval T. As. illustrates in Fig. 3, four events that have occurred in S1, where each event record consists of event number, event records and time of the event occurance. Suppose e1 is an example of event which occurred at 7:00 am with event record value of 115mp/h. While similar event e1 and e2 at 21:00 to 21:10 consid-ered as contextual anomalous event based on the CA as denonted in Equation ?

CA = { ei, vi, | w | }

(4)

When ei is an event in CA per window partition (wt+1, wt+2, ...,) and vi is associated with the event stream scores value in Si based on the Anomaly Score (AScore) as described in the next section.

Definition 9 (Contextual Event Score). The output of anomalous event is associated with the result of event streams computed by the proposed contexual algorithm. This is achieved by using re-gression rule model to score the probability of contextual event in the temporal sequence.

AScore 1n∑e i=1

n

log( P (ei=v|r )1−P (e i=v|r ) )¿

5)

1n∑e i=1

n

log (P (e i=v|r )¿−log (1−P (ei=v|r ))¿6)

When probability of each given event stream value v is expected to be positive if event at-tributes ei in Si is less than

P (ei = v|r) < 0.5, in contrast, if the P (ei = v|r) > 0.5 the value of the event stream score is ex-pected to be negative value. However, r rule com-

Page 6: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

6 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. #, NO. #, MMMMMMMM 1996

putes from the sequence of r and for the purpose of contextual behaviour. Algorithm 3, describes the Contextual Event Stream Anomaly (CESA) process and the idea of the rule set structure is used in many data stream research studies includ-ing in [12] (ref). AScore computation is based on computing the probability of tuple value.

5 CONTEXUAL ANOMALY DETECTION UISNG DAG MODEL

Understanding the concept of parallism method is necessary for the basis of proposed solution in this section. This section is describing the high throughput of contextual event streams parallilsm processing, partitioning, and detection designed method.

Data subset division and task parallelisms are the core methods of distributed data processing [11]. Hence, to consider the scalability of data processing and mining, several distributed big data stream processing models have been emereged as solution. Direct Acyclci Graph (DAG) data structure model (ref) is one of the model. Hence, DAG is associated with finit process of graphs regards of equivalaity and the concept is first formalized by peral 1988 (ref). This appori-ach is very bentificial to propose to detect anoma-lous event from high volumes of streams in paral-lel.

DAG is similar to Map Reduce (MR) data struc-ture model for mapping, shuffling and reducting batch data. Since in DAG, streaming data can be mapped by; sheffle grouping, field grouping or all grouping functions. While the reducting phase is based on seveal operation functions such as sum, average, filter, min and max. The benefit of DAG model is it’s capability to handle both task and data parallisms. In this context, we designed Dis-tributed Contextual Event Detection (DCED) based on the DAG data structure model. DCED comprises of three modules; pre-processing,

matching events and contextual detection as de-picts in Fig. 4.

Fig. 4. Conetxual event stream anomaly using DAG model.

5.1 Pre-Processing NodeEvent stream tuple are usually arrives at a very high rate, while in most conditions it is difficult to process or store a complete size of the data streams. Thus, alternative solution is to conduct window partition w to capture and organise events {e1, e2, .., ei} in window slides. In this module, window partition is required to be de-fined before the final aggregation. The task of Pre-processing (P) node is to read event streams, convert them into event tuple key-value pair (e.g., event number, value and time). However, missing event value tuple records have been removed to reduce a high number of event streams with null records, this is protect change in the event stream data distribution (concept drift).

The other task of these P nodes are to scan and determine high and low tuple values and com-putes the probability rates of anomalous event score p (w = i|e < 1) or p(w = i|e > 0) and emits the event to the next node. The result then can be tested on arrived events with captured events in each window partition.

5.2 Event Matching NodeMatching (M) event stream is an important step of DAG in the DCED. The task of each matching node is to split and shuffle event streams accord-ing to their tuple values in each window partitions and allocated event streams by one of the group-ing functions across each window partitions. For example, event stream are matched and filtered with other window partitions based on ei tuple values v records based on either (high, low, max, min) from rule r threshold P (ei = v|r) > 0.5 or P (ei = v|r) < 0.5. In this node, new event stream tu-ple value records have been filitered based on two aggrettion functions (newmin) for event tuple value < threshold and (newmax) for event tuple value > threshold as follows;

e i=e−min emaxe−mine

(ne wmax e−newmine )

(7)

5.3 Contextual Detection Node The main task of CA node is to detect number of

6. foreach r do

7. computes r using Eq. (6)

8. If ei > r ∈ RS

9. then

10. CA ← e i validate r

11. RS update AScore {0,1}

12. else

13. If ei < r ∈ RS then

14. CA ← e i validate r

Page 7: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

AUTHOR: TITLE

contextual anomalous events from each n node. The contextual event is based on AScore model computational according to the defined event stream in the sequence number, value and time-

stamp. For exam- ple, candi- date con- tex-tual event is defined based on the context of event in each window. This is achived by grouping events and detecting such events based on the DAG topology in Distributed Stream Processing (DSP). DAG model support rule-based learning based on IF-THEN rules as depicts in Fig 5. This is based on defined a set rules = {r1,…, rn} to statisfies the rule condition r ∈ {0,1}. The rule-based for e1 as-sumption can be expressed according to the fol-lowing conditions.

Fig. 5. rule-based CA condition graph in DAG.

Consider arriving events from number of streams when each stream S consists of events {e1, e2, .., ei}.

Algorithm 3 computes the probability of the event score in the CA DAG node. The algorithm starts with an empty rule set r = { } and then identifies if the event stream is covered by CA

model based on their AScore. For every event stream e i in S, each rule set is required to be checked and computed based on the Equation (5). If probability of any event stream value according to the context attribute ei is changed, the rule set can be removed and the value of event stream within the CA and it can be updated.

On the other hand, if AScore {0,1} rule set con-dition has not statisfied for event stream tuple in S, RS update the AScore based on the event con-text value in the event sequence. On the contrary, if ei is covered by the rule sets and not considered as CA, the PH test computes the error e based on the α magnitude of changes and update the r (see section for error compuation). Thus, if event stream ei is covered by RS according to the CA, then the algorithm returns CA values based on the rule set condition threshold. For example, the output of the CESA is either {0,1}, 0 is associated with event if t1= 10:00am, and 1 as CA if t2= 23:00pm.

6 IMPERICAL EVALUATION AND DISCCUSIONThis section describes the experimental environ-ment, results and performance evaluation for the proposed Algorithms and method.

6.1 Experimental EnvironmentThe proposed algorithm is implemented in Java programming language and implemented on dis-tributed stream processing framework Apache Storm (ref). Storm is a real time stream frame-work which is capable to process of 1 miilion stream in one scond per computer node. Thus, we desgined DAG topology to run in parallel. The main task of the topology in Storm is to assign tasks between each operators and manage the task of data distribution over each node. The ex-periments have been deployed on the University of Huddersfield’s commodity cluster of eight com-puter nodes. One computer node acted as master node and seven as slave nodes. Each computer node is equipped with 8GB of RAM, configured with an Intel(R) Core(TM) 4 Quad CPU Q8400. All the node’s operating system have been configured with Ubuntu Xenial (v16.04.1 LTS), Java(TM) SE Runtime Environment (build 1.8.0_10), and Java

6. foreach r do

7. computes r using Eq. (6)

8. If ei > r ∈ RS

9. then

10. CA ← e i validate r

11. RS update AScore {0,1}

12. else

13. If ei < r ∈ RS then

14. CA ← e i validate r

IF e1 AND t1 > r THEN r ∈ {0, 1} = 1

IF e1 AND t1 < r THEN r ∈ {0, 1} = 0

Page 8: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

8 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. #, NO. #, MMMMMMMM 1996

HotSpot (TM) 64-Bit Server VM. While each com-puter node is divided into 4 workers.

6.2 Data SourcesTwo IoT sensor streams were proposed from road traffic and building temperature applications. Data streams have been filtered based on the rule set of anomalous event stream tuple records, for example, only event stream tuple with either (high, low, max, min) record values have been emitted to DAG model to computes the contextual behaviour. The main advantage of this approach is to reduce the size of event streams from irrele-vant values. Thus, sensor streams from only (S1, S2, and S3) is mapped and filtered based on the fieldGrouping (e.g., speed, temperature) in the DAG model. The size of event streams have been reduced from 210 million sensor readings to ap-proximately 1,0335,000 tuples.

6.3 Results and DiscussionTo evaluate the output of contextual anomalous event scoring results, CESA algorithm is evalu-ated based on the probability of each event over 10,000 traing tuple. This is approximately 10% of the size of each window partition, while the rest 90% of event stream tuples utilised to test the CA

model. The algorithm’s learning procedure is de-pended on the shuffling event streams by DAG model over each window partition to measure es-timatied error.

6.4. CESA Algorithm Performance The evaluation of CESA algorithm’s performance is based on several facts as follows;

Size of event streams.

Detection accuracy.

One of the most prominent aspects of the pro-posed CESA algorithm is the capability of the al-gorithm to learn to detect contextual anomalous event in real time. The experimental evaluation of CESA algorithm is mainly depends on the the size of event streams to test. The results of CESA pro-cessing performance time of n event streams O(n) depicts in Fig. 6. On the other hand, processing a smaller size of event streams values per window partition is indicated as impact on computational complexity in terms of O (k log n) performance. Thus, high throughput of event streams is not guaranteed to improve the lower processing com-putationa result.

Fig. 6. CESA detection performance per node on the cluster

As shown in Fig. 6. increasing the number of nodes from 1 to 8 gurantees the processing time by approximately 50% in parallel. Consider result of compuataional processing time over 8 nodes, the result shows a significant improvement in the performance from 65 to 45 milliseconds to process and detect high volumes of event streams. This is an ideal approach to reduce over-head in each computer node as the size of event streams have been scaled up.

6.5. Scalability Evaluation Result The scalability of processing event streams can have major impacts on the computational perfor-mance of the proposed algorithm and DAG model.

In this context, the evaluation performance result is measured based on;

Event Stream Size, the proposed event stream size threshold is stetted as t =100,000 stream tuples per node to evalu-ates the effectiveness of the computational performance.

Number of window partitions to evaluates and assesses the impact of the computa-tional results and processing run time per-

Page 9: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

AUTHOR: TITLE

formance.

Scalability, increasing number of computer node can perform effectively with less pro-cessing runtime performance. This assump-tion is experessed as follows;

p=1N processingTime

processingtimeNnode×N

(8)

When p is referred to processing runtime per-formance time of each N node in the cluster. This is a common metric to measure the parallelism performance for the runtime detection process. The impact of scalablility of event streams pro-cessing runtime for both centralised with distrib-uted approached depicts in Fig.7. The result is in-dicated that the performance of the detection process is linearly increased as the event streams have been scaled up.

The result of implementing CESA algorithm on a standalone computer requires more computa-tional detecting processing runtime to computes event matching and detecting anomalous events compared to the distributed method. In this con-text, we tested the algorithm with sets event stream size threshold of 100,000 per computer node to evaluate the detecting computational per-formance.

Fig. 7. Comparision of DAG runtime detection performance on standalone and distributed nodes.

The processing of the detection is primarily based on the CESA algorithm’s runtime perfor-mance based on distributed and centrealised methods. The performance of CESA algorithm’s processing runtime is recorded in millisecond (ms) as depicted in y-axis. It is evident that

process and computes Contextual event streams over 800,00 tuples, so the CESA algorithm is re-quired for less than 400 milliseconds (0.4 ms). On the contrary, for testing similar event stream size with threshold e > 1000 is expected higher pro-cessing runtime of 1120 (1.12 ms) is expected to computes the detection centrally.

Importantly, as the size of event stream thresh-old e > t is scaled up to 800,00, the detection process of the runtime performance has also lin-early increased and doubled.The result has demonstrated that the proposed CESA algorithm over DCED framework is performed effectively with regard to the processing of detection perfor-mance runtime in real time.

6.1 Contexual Anomaly Node in DAG ModelThe higher the accuracy of AScore indicates the stability of the CESA of detecing contexual anomalous events. The CA node computes event streams according to the event occurrence time and their temporal order per each window parti-tion. The algorithm in DAG is learnt from 100,000 event stream tuples, which is approxematly 10% of the expected size of the data streams and the rest of 90% event stream tuples are proposed to train the model.The results of CESA algorithm from sequence of event streams in parallel is pre-sented in x-axis in Fig.8. while y-axis represents the result contextual anomalous events. In this context, the result is demonstrated that S1 is con-sisted a higher number of contextual anomalous events compared to S2 and S3 for every window partition w = 200,000 tuple. This is due to the high number of conetxual events in each window partitions based on the contextual AScore. Thus,this is shwon that on average, 50% of con-textual events have been occurred in S1 rather than other two sennsor streams S2 and S3. How-ever, the number of contextual event stream per window is increased linearly as the number of event streams have been scaled up. In this con-text, increasing the number of events are not only based on scaling up the size, as it can be de-pends on the number of events streams per win-dow partitions.

Processing Time (in millisecond)

Page 10: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

10 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. #, NO. #, MMMMMMMM 1996

Fig. 8. Contextual event stream result from event stream

As shown in Fig.9., the top-axis represents the size of event streams high throughput. While, x-axis is representing time-based contextual anoma-lous events in between t = 0:00, t = 12, t =24:00 intervals per each sliding factor δ = 3. The model is then trained across every event partition of w = 200,000 tuples. On the other hand, the y-axis is representing the number of, for example, high and events stream scores (red symbol), and low or normal event scores (green symbol).

6.1 Learning Algorithm and Prediction Error RateThe model's process of learning from event stream accuracy rate is acceptable, this is based on the accuracy rate of the positive prediction er-ror. Importantly, the output result of scoring over the total size of event stream is mainly depends on the learning algorithm from the change when ei ∉ RS or ei ∉ CA.

To evaluate a performance of the proposed model and algorithm and to estimate the compu-tational error rate, both Holdout and Prequential have been suggested (Mouss et al., 2004). The former metric is complex and computationally ex-pensive to test the algorithm over high volumes of event streams. On the other hand, Prequential evaluation metric is mainly based on test-and-train process, thus, this metric is more reliable than the previous metric to estimate of the algo-rithm and model’s error rate (ref). Prequential Page-Hinckley (PH) test is an appropriate method to measure the accurancy of the model’s computa-tional score and monitor the change in the event stream values.

For the Contextual Anomaly CA performance evaluation, the value for threshold is set accord-ing to AScore < r . In this context, if the probabil-ity of value v of ei <0.5, then the computation ra-tio considered as positive result. On the contrary,

if the probability of value v of ei > 0.5, it assumes that the computational ratio is negative. The neg-ativity of ratio is due to number of changes in the event stream tuple during the data distribution when the algorithm is predicted in the learning process. The first assumption is depended on dis-similarly of the event stream tuples [13]. Impor-tantly, PH is subject to accumulated sum of a loss function error for the sudden change. This valida-tion is realistic to propose and to measure event stream change value during the stream process-ing and partitioning.

Fig.9. Contextual anomalous event stream time-based detection.

Change in event stream is common and the de-tection validation is expected when new event stream added into the window partition e t+1. Thus, any change in the event value tuple at t+1 predicts an error e. This is based on two computational loss functional error metrics of Mean Absolute Er-ror (MAE) and Root Mean Square Error (RMSE). The prediction error rate can be defined when the rule set is not covered by the model. Similar met-rics have been used in [14] to monitor change in the streaming data.

e t=1t∑t=1

t

et(9)

Processing Time (in millisecond)

Page 11: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

AUTHOR: TITLE

mT−1T∑

t

1

{et−e t−α }(10)

This concept is based on testing ∑ of cumula-tive mT and the assumption is ∑

i=1

n

e . The differ-ences between the observed ei and their mean can be set in time interval [1, t] and change is ex-pected according to Equations 9. and 10. When mT is associated with the maintaining of the mini-mum test of event stream mT (mt, t = 1,. . ., T). When α refers to the change in every ei at t. The threshold parameter λ can be set to observes mT ∈ 0.0 and 1.0, in the seqeuence of (e1, e2, . . ., en). The α is associated with the magnitude of changes when PH computes PHT = mT –MT.

The learning procedure by CESA is defined based on the result of how adptable the algorithm is learnt from the event stream distribution change. This is mainly based on computed accu-mulative sum of the loss function error as denoted L ( f ) over n number of event streams. The fading factor parameter is set as α=0.5 for both MAE and RMSE and the average of Prequential error to test every window partitions of w = 200,000 event stream tuples; hence, this number for the decay factor is an ideal number to measure the error rates. The summary of the CESA algorithm’s parameters is described in Table 1.

Fig. 9. depicts the CESA algorithm's learning prediction error rate. RSME represents in ∆, which indicates the result of mean square error according to predefined fading factor range, while the result of prediction error slightly decreases from 0.4 to 0.3. Thus, this validate a positive re-sult since the size of event stream is scaled up.

Fig. 10. CESA Algorithm's learning's perfor-mance and prediction error rate.

Table 1.CESA algorithm experimental evaluation parameters.

This is evident the 95% accuracy of the result from testing event streams by the CA model to predict the error. On the other hand, MAE repre-sented in blue (×) symbol and consists of several points which indicates the change in the streams sequence over the time; however, as the size of event stream is scaled up, the change rate is de-creases due to the less occurred changes in the event stream behaviour and window partitioning mechanism. This demonstrates how accurate the model to predict the error rates and is able to rapidly adapt to the change from high volumes of event stream partitioning in parallel. Importantly, such results are indicating the stability of the CESA algorithm in time evolving situation and the model stability in relation to the detecting accu-racy scoring rate.

As depicts In Fig.11., both MAE and RSME re-sults aredemonstarting the accuracy of the CA

model cost function error to measures the AScore accuracy rate. The low error score relates with high accuracy of the algorithm in relation to change detection and recover from the learning prediction. In contrast, high error prediction is in-dicating the instability of the algorithm in terms of delay in recovering after the change is de-tected. In relation to the stability of CESA algo-rithm, middle dash line represents how adaptive the algorithm in terms of learning and predicting from the event streams.

The result is validated the probability of AScore compuation by the CESA to measure and test event stream computational score range < 0.5 positive or > 0.5 negative. The dash (-) line repre-sents the predefined threshold t <0.25, while the threshold line is considered as prediction score error rate.

Page 12: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

12 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. #, NO. #, MMMMMMMM 1996

Fig.11. CA Model AScore error prediction rate learning.

As new event stream ei trained by the CA model AScore updates effectively. This is achieved by train event streams per window partition w = 200,000 over L = 1,000,000 tuples. The experi-ment results have indicated that the rule set com-putational error of CA model for each event stream is reasonable.

7 CONCLUSIONDetecting high volumes of anomalous events from stream is required a robust method and novel al-gorithm to to handle the high rate of streams. In this paper, we have achived this by designing a novel contextual event stream detecting model through;

1. defines an event stream processing data structure model.

2. designs CA model to define the context of the event.

3. propose a robust DAG model to handle the scalabity of high volumes of streams.

First, the high volumes of event streams then divided into several window partitions to handle

the high rate of streams. The main benefit of this approach is to handle event streams from data distribution and changes in the data streams. Sec-ond, the CA model is desgined based on event stream data key-value structure. The CESA algo-rithm is implemented to first check the event stream status according to the rule set of the AS-core. If the Ascore probability of event stream is high, then such event stream is considered as contextual anomalous event.Third, to deal with processing high volumes of stream scalability con-cern, we designed a novel distributed method based on DAG distributed and parallel processing model. The DAG made of three distributed compu-tational node; Pre-Process, Match, and Contextual Anomaly. The CESA algorithm is implemented on the CA module phase of the DAG model.

The experimental results have demonstrated the effectiveness of each fact which we have mea-sured and tested based on the on the event stream size, the number of window partitions, and scaling up the processing time. The result shows that distributed and paralle DAG performance is more efficient than standalone approach to com-pute and detect high number of contextual event behaviour in real time. The main drawback of cen-tralised computation is the number of designed computational functions, which we have used per each node in the topology to perform by the work-ers in parallel. The experimental result is satisfied the assumption of detection high volumes of event streams, for example, one million event streams per less than one second.

As future work, Contextual anomalous event detection can be extended in further such as so-cial media stream data by assigning AScore to kth

nearest window partition. Contextual snapshot model based on matching dissimilar collection of data according to their context and time-series behavior is alterantive solition to be proposed. This can be achieved by dividing arriving event streams into different snapshot time-based inter-val window partitions. The anomaly snapshot model can be based on groups of dissimilar con-textual events. This problem can be further inves-tigated in unsupervised learning; for example, proposing microcluster for grouping similar event streams according to their tuple values (ce1, ct1) (ce2, ct2). The event streams can be set into three time interval of t = 8, t = 16, t = 24 per each win-dow partitioning. Then partitioned event values according to their contextual can be classified in a segment of events per window partition.

Offline and online distributed anomaly detec-tion approach is another future research direction to be studied. This approach is already proposed in some of the research disciplines of data mining

Page 13: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

AUTHOR: TITLE

and machine learning; thus, we believed that dis-tributed hybrid contextual anomaly detection re-search is a challenging task to be studied for the future work. This can be achieved by, first, build-ing conextaual anomaly model from historical event data behaviour in offline, and training the arriving new event streams on online over each window partitions.

REFERENCES[1] 1. Angiulli, F. and F. Fassetti, Distance-based outlier

queries in data streams: the novel task and algorithms. Data Mining and Knowledge Discovery, 2010. 20(2): p. 290-324.

[2] 2. Chandola, V., A. Banerjee, and V. Kumar, Anomaly De-tection for Discrete Sequences: A Survey. IEEE Transac-tions on Knowledge and Data Engineering, 2012. 24(5): p. 823-839.

[3] 3. Yexi Jiang, C.Z., Jian Xu, Tao Li, Real time contextual collective anomaly detection over multiple data streams. 2014.

[4] 4. Saleh, O., S. Hagedorn, and K.-U. Sattler, Complex Event Processing on Linked Stream Data. Datenbank-Spek-trum, 2015. 15(2): p. 119-129.

[5] 5. Mohamed Medhat Gaber, A.Z.a.S.K., Mining Data Streams: A Review in SIGMOD Record. 2005.

[6] 6. Hao Huang, S.P.K., Streaming Anomaly Detection Us-ing Randomized Matrix Sketching. 2015.

[7] 7. Aggarwal, C.C., Data streams: models and algorithms. Vol. 31. 2007: Springer Science & Business Media.

[8] 8. Brian Babcock, S.B., Mayur Datar, Rajeev Motwani, Jennifer Widom, Models and Issues in Data Stream Sys-tems Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom. 2002.

[9] 9. Charu C. Aggarwal, J.H., Jianyong Wang, Philip S. Yu, A Framework for Clustering Evolving Data Streams. 2003.

[10] 10. Pham, D.-S., et al., Anomaly detection in large-scale data stream networks. Data Mining and Knowledge Discov-ery, 2012. 28(1): p. 145-189.

[11] 11. Park, B.-H. and H. Kargupta, Distributed data mining: Algorithms, systems, and applications. 2002.

[12] 12. Duarte, J., J. Gama, and A. Bifet, Adaptive Model Rules From High-Speed Data Streams. ACM Transactions on Knowledge Discovery from Data, 2016. 10(3): p. 1-22.

[13] 13. Daniel Kifer, S.B.-D., Johannes Gehrke, Detecting Change in Data Streams. 2004.

[14] 14. H. Mouss, D.M., N Mouss and L Sefouhi, Test of Page-Hinckley, an approach pow fault detection in an agro-ali-mentary production system. 2004.

[15] Abadi, D. J., Carney, D., Etintemel, U., Cherniack, M., Convey, C., Lee, S., … Zdonik, S. (2003). Aurora: a new model and ar-chitecture for data stream management. The VLDB Journal The International Journal on Very Large Data Bases, 12(2), 120–139. https://doi.org/10.1007/s00778-003-0095-z

[16] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … Brain, G. (2016). TensorFlow: A System for Large-Scale Machine Learning TensorFlow: A system for large-scale ma-chine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) (pp. 265–284). https://doi.org/10.1038/nn.3331

[17] Aggarwal, C. C., & Wang, J. (2007). Data Streams: Models and Algorithms. Data Streams, 31, 9–38. https://doi.org/10.1007/978-0-387-47534-9

[18] Aggarwal, C. (2016). Outlier Analysis. New York: Yorktown Heights.

[19] Aggarwal, C. C. (2012). A segment-based framework for modelling and mining data streams. Knowledge and Infor-mation Systems, 30(1), 1–29. https://doi.org/10.1007/s10115-010-0366-0

[20] Aggarwal, C. C., Watson, T. J., Ctr, R., Han, J., Wang, J., & Yu, P. S. (2003). A Framework for Clustering Evolving Data Streams. Proc. of the 29th Int. Conf. on Very Large Data Bases, 81–92. https://doi.org/10.1.1.13.8650

[21] Agneeswaran, V. S. (2014). Big Data Analytics Beyond Hadoop. Big Data Analytics Beyond Hadoop. Retrieved from http://ptgmedia.pearsoncmg.com/images/9780133837940/samplepages/0133837947.pdf

[22] Akcora, C. G., Carminati, B., Ferrari, E., & Kantarcioglu, M. (2014). Detecting anomalies in social network data consump-tion. Social Network Analysis and Mining, 4(1), 1–16. https://doi.org/10.1007/s13278-014-0231-3

[23] Akter, S., & Wamba, S. F. (2016). Big data analytics in E-com-merce: a systematic review and agenda for future research. Electronic Markets, 26(2), 173–194. https://doi.org/10.1007/s12525-016-0219-0

[24] Alam, A., & Ahmed, J. (2014). Hadoop architecture and its issues. In Proceedings - 2014 International Conference on Computational Science and Computational Intelligence, CSCI 2014 (Vol. 2, pp. 288–291). https://doi.org/10.1109/CSCI.2014.140

[25] Almeida, E., Ferreira, C., & Gama, J. (2013). Adaptive model rules from data streams. In Lecture Notes in Computer Sci-ence (including subseries Lecture Notes in Artificial Intelli-gence and Lecture Notes in Bioinformatics) (Vol. 8188 LNAI, pp. 480–492). https://doi.org/10.1007/978-3-642-40988-2_31

[26] Amini, A., Saboohi, H., & Wah, T. Y. (2013). A multi density-based clustering algorithm for data stream with noise. In Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013 (pp. 1105–1112). https://doi.org/10.1109/ICDMW.2013.170

[27] Amini, A., Wah, T. Y., & Saboohi, H. (2014). On density-based data streams clustering algorithms: A survey. Journal of Computer Science and Technology. https://doi.org/10.1007/s11390-014-1416-y

Page 14: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

14 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. #, NO. #, MMMMMMMM 1996

[28] Andrade, H., Gedik, B., Wu, K.-L., & Yu, P. S. (2011). Process-ing high data rate streams in System S. Journal of Parallel and Distributed Computing, 71(2), 145–156. https://doi.org/10.1016/j.jpdc.2010.08.007

[29] Angiulli, F., & Fassetti, F. (2010). Distance-based outlier queries in data streams: The novel task and algorithms. Data Mining and Knowledge Discovery, 20(2), 290–324. https://doi.org/10.1007/s10618-009-0159-9

[30] Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Ito, K., Mot-wani, R., … Widom, J. (2004). STREAM: The Stanford Data Stream Management System. Concrete, (2004–20), 1–21. https://doi.org/http://ilpubs.stanford.edu:8090/641/1/2004-20.pdf

[31] Amen, B., & Lu, J., (2015). Sketch of Big Data Real time Ana-lytics Model. The Fifth International Conference on Ad-vances in Information Mining and Management (IMMM), 21st - 26th June 2015, Brussels, Belgium.

[32] Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79–80, 3–15. https://doi.org/10.1016/j.jpdc.2014.08.003

[33] Atzori, L., Iera, A., & Morabito, G. (2010). The Internet of Things: A survey. Computer Networks, 54(15), 2787–2805. https://doi.org/10.1016/j.comnet.2010.05.010

[34] Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. In Pro-ceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS ’02 (p. 1). https://doi.org/10.1145/543613.543615

[35] Badarna, M., & Wolff, R. (2014). Fast and accurate detection of changes in data streams. Statistical Analysis and Data Mining, 7(2), 125–139. https://doi.org/10.1002/sam.11216

[36] Bai, M., Wang, X., Xin, J., & Wang, G. (2015). An efficient al-gorithm for distributed density-based outlier detection on big data. Neurocomputing, 181, 19–28. https://doi.org/10.1016/j.neucom.2015.05.135

[37] Baldoni, R., Querzoni, L., Tarkoma, S., & Virgillito, A. (2009). Distributed event routing in publish/subscribe systems. In Middleware for Network Eccentric and Mobile Applications (pp. 219–244). https://doi.org/10.1007/978-3-540-89707-1_10

[38] Beigi, M. S., Chang, S.-F., Ebadollahi, S., & Verma, D. C. (2011). Anomaly detection in information streams without prior domain knowledge. IBM Journal of Research and De-velopment, 55(5), 11:1-11:11. https://doi.org/10.1147/JRD.2011.2163280

[39] Bhatnagar, V., Kaur, S., & Chakravarthy, S. (2014). Clustering data streams using grid-based synopsis. Knowledge and In-formation Systems, 41(1), 127–152. https://doi.org/10.1007/s10115-013-0659-1

[40] Bifet, A., Morales-bueno, R., Baena-Garcia, M., Campo-Avila, J. Del, Fidalgo, R., Bifet, A., … Morales-bueno, R. (2006). Early Drift Detection Method. In 4th ECML PKDD Interna-tional Workshop on Knowledge Discovery from Data Streams (Vol. 6, pp. 77–86). https://doi.org/10.1.1.61.6101

[41] Bifet, A., de Francisci Morales, G., Read, J., Holmes, G., & Pfahringer, B. (2015). Efficient Online Evaluation of Big Data Stream Classifiers. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (April 2016), 59–68. https://doi.org/10.1145/2783258.2783372

[42] Bifet, A. (2009). Adaptive Learning and Mining for Data Streams and Frequent Patterns. Dissertation Universitat Po-litecnica de Catalunya, 11(1), 55–56. https://doi.org/10.1145/1656274.1656287

[43] Bondu, A., & Boullé, M. (2011). A supervised approach for change detection in data streams. International Joint Confer-ence on Neural Networks (IJCNN), 8. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6033265%0Ahttp://www.marc-boulle.fr/publica-tions/BonduEtAlIJCNN11.pdf

[44] Brzezinski, D., & Stefanowski, J. (2014). Reacting to different types of concept drift: The accuracy updated ensemble algo-rithm. IEEE Transactions on Neural Networks and Learning Systems, 25(1), 81–94. https://doi.org/10.1109/TNNLS.2013.2251352

[45] Cao, F., Ester, M., Qian, W., & Zhou, A. (2006). Density-based clustering over an evolving data stream with noise. Proceed-ings of the Sixth SIAM International Conference on Data Mining, 2006, 328–339. https://doi.org/10.1145/1552303.1552307

[46] Caron, E., & De Assuncao, M. D. (2017). Multi-criteria mal-leable task management for hybrid-cloud platforms. In Pro-ceedings of 2016 International Conference on Cloud Com-puting Technologies and Applications, CloudTech 2016 (pp. 326–333). https://doi.org/10.1109/CloudTech.2016.7847717

[47] Chakrabarti, K., Keogh, E., Mehrotra, S., & Pazzani, M. (2002). Locally adaptive dimensionality reduction for index-ing large time series databases. ACM Transactions on Data-base Systems, 27(2), 188–228. https://doi.org/10.1145/568518.568520

[48] Candela, V., Banerjee, A., & Kumar, V. (2012). Anomaly detec-tion for discrete sequences: A survey. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2010.235

[49] Candela, V., Banerjee, A., & Kumar, V. (2009). Anomaly detec-tion: A survey. ACM Computing Surveys (CSUR), 41(Septem-ber), 1–58. https://doi.org/10.1145/1541880.1541882

[50] Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., … Shah, M. (2003). Tele-graphCQ: Continuous Dataflow Processing for an Uncertain

Page 15: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

AUTHOR: TITLE

World. Cidr, 20(March), 668. https://doi.org/10.1145/872757.872857

[51] Chaudhry, N., Shaw, K., & Abdelguerfi, M. (2005). Stream Data Management (1st ed.). US: Springer US

[52] Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. In Mobile Networks and Applications (Vol. 19, pp. 171–209). https://doi.org/10.1007/s11036-013-0489-0

[53] Cugola, G., & Margara, A. (2012). Processing flows of infor-mation. ACM Computing Surveys, 44(3), 1–62. https://doi.org/10.1145/2187671.2187677

[54] De Matteis, T., & Mencagli, G. (2016). Keep calm and react with foresight: Strategies for Low- Latency and Energy-Effi-cient Elastic Data Stream Processing. Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP ’16, 1–12. https://doi.org/10.1145/2851141.2851148

[55] Demšar, J., & Bosnić, Z. (2018). Detecting concept drift in data streams using model explanation. Expert Systems with Applications, 92, 546–559. https://doi.org/10.1016/j.eswa.2017.10.003

[56] Ding, S., Wu, F., Qian, J., Jia, H., & Jin, F. (2015). Research on data stream clustering algorithms. Artificial Intelligence Re-view, 43(4), 593–600. https://doi.org/10.1007/s10462-013-9398-7

[57] Dobre, C., & Xhafa, F. (2014). Parallel programming para-digms and frameworks in Big Data Era. International Journal of Parallel Programming, 42(5), 710–738. https://doi.org/10.1007/s10766-013-0272-7

[58] Doulkeridis, C., & Nørvåg, K. (2014). A survey of large-scale analytical query processing in MapReduce. VLDB Journal. https://doi.org/10.1007/s00778-013-0319-9

[59] Department for Business, Energy & Industrial Strategy. (2016). Smart Meters Quarterly Report to End September: final report. Retrieved from

[60] https://www.gov.uk/government/uploads/system/uploads/at-tachment_data/file/579197/2016_Q3_Smart_Meters_Report_Final.pdf

[61] Duarte, J., Gama, J., & Bifet, A. (2016). Adaptive Model Rules From High-Speed Data Streams. ACM Transactions on Knowledge Discovery from Data, 10(3), 1–22. https://doi.org/10.1145/2829955

[62] Erfani, S. M., Rajasegarar, S., Karunasekera, S., & Leckie, C. (2016). High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognition, 58, 121–134. https://doi.org/10.1016/j.patcog.2016.03.028

[63] Esposito, C., Ficco, M., Palmieri, F., & Castiglione, A. (2015). A knowledge-based platform for big data analytics based on publish/subscribe services and stream processing. Knowl-

edge-Based Systems, 79, 3–17. https://doi.org/10.1016/j.-knosys.2014.05.003

[64] Eugster, P. T., Felber, P. A., Guerraoui, R., & Kermarrec, A.-M. (2003). The many faces of publish/subscribe. ACM Comput-ing Surveys, 35(2), 114–131. https://doi.org/10.1145/857076.857078

[65] Facca, F. M., & Lanzi, P. L. (2005). Mining interesting knowl-edge from weblogs: A survey. Data and Knowledge Engi-neering. https://doi.org/10.1016/j.datak.2004.08.001

[66] Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., … Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Trans-actions on Emerging Topics in Computing, 2(3), 267–279. https://doi.org/10.1109/TETC.2014.2330519

[67] Fan, J., & Liu, H. (2013). Statistical analysis of big data on pharmacogenomics. Advanced Drug Delivery Reviews. https://doi.org/10.1016/j.addr.2013.04.008

[68] Faria, E. R., Gonçalves, I. J. C. R., de Carvalho, A. C. P. L. F., & Gama, J. (2016). Novelty detection in data streams. Artificial Intelligence Review, 45(2), 235–269. https://doi.org/10.1007/s10462-015-9444-8

[69] Faria, E. R., Gama, J., & Carvalho, A. C. (2013). Novelty de-tection algorithm for data streams multi-class problems. Pro-ceedings of the 28th Annual ACM Symposium on Applied Computing, 795–800. https://doi.org/10.1145/2480362.2480515

[70] Farid, D. M., Zhang, L., Hossain, A., Rahman, C. M., Stra-chan, R., Sexton, G., & Dahal, K. (2013). An adaptive ensem-ble classifier for mining concept drifting data streams. Ex-pert Systems with Applications, 40(15), 5895–5906. https://doi.org/10.1016/j.eswa.2013.05.001

[71] Ferrer-Troyano, F., Aguilar-Ruiz, J. S., & Riquelme, J. C. (2005). Incremental rule learning based on example near-ness from numerical data streams. In Proceedings of the 2005 ACM symposium on Applied computing - SAC ’05 (p. 568). https://doi.org/10.1145/1066677.1066808

[72] Folino, G., & Sabatino, P. (2016). Ensemble based collabora-tive and distributed intrusion detection systems: A survey. Journal of Network and Computer Applications. https://doi.org/10.1016/j.jnca.2016.03.011

[73] Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. Brazilian Symposium on Artifi-cial Intelligence, 286–295. https://doi.org/10.1007/978-3-540-28645-5_29

[74] Gama, J. (2012). A survey on learning from data streams: current and future trends. Prog Artif Intell, 1, 45–55. https://doi.org/10.1007/s13748-011-0002-6

[75] Gama, J., Sebastião, R., & Rodrigues, P. P. (2009). Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD international conference on Knowl-

Page 16: livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3028967/1/Draft-3.docx  · Web viewhe age of big digital data is emerged and high volume of data is generating through

16 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. #, NO. #, MMMMMMMM 1996

edge discovery and data mining - KDD ’09 (p. 329). https://doi.org/10.1145/1557019.1557060

[76] Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 1–37. https://doi.org/10.1145/2523813

[77] Gao, X., Ferrara, E., & Qiu, J. (2015). Parallel clustering of high-dimensional social media data streams. In Proceedings - 2015 IEEE/ACM 15th International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015 (pp. 323–332). https://doi.org/10.1109/CCGrid.2015.19 Prof. Grigoris Antoniou 1989 he completed his

Grigoris Antoniou is a Professor at the University of Huddersfield in the Computer Science and Informat-ics departme

nt. He received his PhD at Osnabrueck (Germany) on program verification. After spending a few more years in Osnabrueck, he emigrated to Australia in 1994, where he stayed until 2001. During that period he moved up the academic ladder, becoming Pro-fessor at Griffith University in 1999. His research in that period was mainly concerned with issues of knowledge representation. Australia was nice and of-fered ample research funding possibilities, but it was too far away from home, so he decided to return to Europe. After a short stay in Bremen, he became Professor at the University of Crete in 2002, where he spent the next 10 years. From 2004 to 2011 he was, in addition, Head of the Information Systems Laboratory at FORTH, the top-rated research institu-

Bakhtiar Amen receieved his undergraduate BSc (Hons) degree in Software Engineering from the University of Huddersfield in 2010. He also com-pleted his post-graduate MSc degree with distinc-tion in Advanced Computer Science in 2013. He completed his Ph.D. at the University of Hudders-field in 2018. He worked at Aston University (UK) as Post-doc research fellow in the field of Smart Light-ing and Energy Efficient Lighting Data Analytics within the Institution of System Analytics for the Big Data Corridor ERDF project. Bakhtiar is member of several professional computing research communi-ties including; British Computer Society (BCS), As-sociation for Computing Machinery (ACM), and In-stitute of Electrical and Electronics Engineers (IEEE). He joined the Department of Computer Sci-ence at University of Liverpool in October 2018.

Dr Violeta Holmes is a Reader in High Perfor-mance Computing at the Huddersfield University with over 25 years of teaching and research experi-ence in computing and engineering.

She leads the High Performance Computing (HPC) Research Group at the University of Huddersfield, and is ARCHER champion and Deep Learning Insti-tute certified instructor. Her research interests and expertise are in the areas of HPC systems infra-structure, computer Clusters, Grids, Cloud comput-ing, Intelligent agents, Big Data, Internet of Things and Embedded systems. She was awarded the sta-tus of Chartered Engineer and Fellowship of Higher Education Academy and is a member of the Insti-tute of Engineering and Technology (IET) and the British Computer Society (BCS).