ieee internet of things journal 1 deep anomaly detection ... · dustrial internet of things (iiot)...

11
IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection for Time-series Data in Industrial IoT: A Communication-Efficient On-device Federated Learning Approach Yi Liu, Sahil Garg, Member, IEEE, Jiangtian Nie, Yang Zhang, Zehui Xiong, Jiawen Kang, M. Shamim Hossain, Senior Member, IEEE Abstract—Since edge device failures (i.e., anomalies) seriously affect the production of industrial products in Industrial IoT (IIoT), accurately and timely detecting anomalies is becom- ing increasingly important. Furthermore, data collected by the edge device may contain the user’s private data, which is challenging the current detection approaches as user privacy is calling for the public concern in recent years. With this focus, this paper proposes a new communication-efficient on- device federated learning (FL)-based deep anomaly detection framework for sensing time-series data in IIoT. Specifically, we first introduce a FL framework to enable decentralized edge devices to collaboratively train an anomaly detection model, which can improve its generalization ability. Second, we propose an Attention Mechanism-based Convolutional Neural Network- Long Short Term Memory (AMCNN-LSTM) model to accurately detect anomalies. The AMCNN-LSTM model uses attention mechanism-based CNN units to capture important fine-grained features, thereby preventing memory loss and gradient dispersion problems. Furthermore, this model retains the advantages of LSTM unit in predicting time series data. Third, to adapt the proposed framework to the timeliness of industrial anomaly detection, we propose a gradient compression mechanism based on Top-k selection to improve communication efficiency. Exten- sive experiment studies on four real-world datasets demonstrate that the proposed framework can accurately and timely detect anomalies and also reduce the communication overhead by 50% compared to the federated learning framework that does not use gradient compression scheme. Index Terms—Federated learning, deep anomaly detection, This research is supported by Alibaba Group through Alibaba Innova- tive Research (AIR) Program and Alibaba-NTU Singapore Joint Research Institute (JRI), NTU, Singapore, and the Open Research Project of the State Key Laboratory of Industrial Control Technology, Zhejiang University, China (No.ICT20044), National Natural Science Foundation of China (Grant No. 51806157), Young Innovation Talents Project in Higher Education of Guangdong Province, China (Grant No. 2018KQNCX333). The authors extend their appreciation to the Researchers Supporting Project number (RSP- 2020/32), King Saud University, Riyadh, Saudi Arabia for funding this work. (Corresponding author: Jiawen Kang) Y. Liu is with the School of Data Science and Technology, Heilongjiang University, Harbin, China (email: [email protected]). S. Garg is with the Electrical Engineering Department, ´ Ecole de technologie sup´ erieure, Uni- versit´ e du Qu´ ebec, Montr´ eal, Canada. (e-mail: [email protected]). J. Nie is with Energy Research Institute, Interdisciplinary Graduate Programme and School of Computer Science and Engineering, Nanyang Technological University, Singapore. (e-mail: [email protected]). Y. Zhang is Hubei Key Laboratory of Transportation Internet of Things, School of Com- puter Science and Technology, Wuhan University of Technology, China (email:[email protected]). Z. Xiong is with Alibaba-NTU Joint Re- search Institute and also School of Computer Science and Engineering, NTU, Singapore. (Email: [email protected]). J. Kang is with Energy Research Institute, Nanyang Technological University, Singapore (e-mail: [email protected]). M. Shamim Hossain is with the Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Saudi Arabia (E-mail: [email protected]). gradient compression, industrial internet of things I. I NTRODUCTION T HE widespread deployment of edge devices in the In- dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such as smart manufacturing, intelligent transportation, and intelligent logistics [1]. The edge devices provide powerful computation resources to enable real-time, flexible, and quick decision making for the IIoT applications, which has greatly promoted the development of Industry 4.0 [2]. However, the IIoT applications are suffering from critical security risks caused by abnormal IIoT nodes which hinders the rapid development of IIoT. For example, in smart manufacturing scenarios, industrial devices acting as IIoT nodes, e.g., engines with sensors, that have abnormal behaviors (e.g., abnormal traffic and irregular reporting frequency) may cause industrial production interruption thus resulting in huge economic losses for factories [3], [4]. Edge devices (e.g., industrial robots), generally collect sensing data from IIoT nodes, especially time-series data, to analyze and capture the behaviors and operating condition of IIoT nodes by edge computing [5]. Therefore, these sensing time series data can be used to detect the anomaly behaviors of IIoT nodes [6]. To solve the abnormality problems from IIoT devices, typi- cal methods are to perform abnormal detection for the affected IIoT devices [7]–[10]. Previous work focused on utilizing deep anomaly detection (DAD) [11] approaches to detect abnormal behaviors of IIoT devices by analyzing sensing time series data. DAD techniques can learn hierarchical discriminative features from historical time-series data. In [12]–[14], the au- thors proposed a Long Short Term Memory (LSTM) networks- based deep learning model to achieve anomaly detection in sensing time series data. Munir et al. in [15] proposed a novel DAD approach, called DeepAnT, to achieve anomaly detection by utilizing deep Convolutional Neural Network (CNN) to predict anomaly value. Although the existing DAD approaches have achieved success in anomaly detection, they cannot be directly applied to the IIoT scenarios with distributed edge devices for timely and accurate anomaly detection. The reasons are two-fold: (i) the most of detection models are not flexible enough in traditional approaches, the edge devices lack dynamic and automatically updated detection models for different scenarios, and hence the models fail to accurately arXiv:2007.09712v1 [cs.LG] 19 Jul 2020

Upload: others

Post on 25-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 1

Deep Anomaly Detection for Time-series Data inIndustrial IoT: A Communication-EfficientOn-device Federated Learning Approach

Yi Liu, Sahil Garg, Member, IEEE, Jiangtian Nie, Yang Zhang, Zehui Xiong, Jiawen Kang,M. Shamim Hossain, Senior Member, IEEE

Abstract—Since edge device failures (i.e., anomalies) seriouslyaffect the production of industrial products in Industrial IoT(IIoT), accurately and timely detecting anomalies is becom-ing increasingly important. Furthermore, data collected by theedge device may contain the user’s private data, which ischallenging the current detection approaches as user privacyis calling for the public concern in recent years. With thisfocus, this paper proposes a new communication-efficient on-device federated learning (FL)-based deep anomaly detectionframework for sensing time-series data in IIoT. Specifically, wefirst introduce a FL framework to enable decentralized edgedevices to collaboratively train an anomaly detection model,which can improve its generalization ability. Second, we proposean Attention Mechanism-based Convolutional Neural Network-Long Short Term Memory (AMCNN-LSTM) model to accuratelydetect anomalies. The AMCNN-LSTM model uses attentionmechanism-based CNN units to capture important fine-grainedfeatures, thereby preventing memory loss and gradient dispersionproblems. Furthermore, this model retains the advantages ofLSTM unit in predicting time series data. Third, to adapt theproposed framework to the timeliness of industrial anomalydetection, we propose a gradient compression mechanism basedon Top-k selection to improve communication efficiency. Exten-sive experiment studies on four real-world datasets demonstratethat the proposed framework can accurately and timely detectanomalies and also reduce the communication overhead by 50%compared to the federated learning framework that does not usegradient compression scheme.

Index Terms—Federated learning, deep anomaly detection,

This research is supported by Alibaba Group through Alibaba Innova-tive Research (AIR) Program and Alibaba-NTU Singapore Joint ResearchInstitute (JRI), NTU, Singapore, and the Open Research Project of theState Key Laboratory of Industrial Control Technology, Zhejiang University,China (No.ICT20044), National Natural Science Foundation of China (GrantNo. 51806157), Young Innovation Talents Project in Higher Education ofGuangdong Province, China (Grant No. 2018KQNCX333). The authorsextend their appreciation to the Researchers Supporting Project number (RSP-2020/32), King Saud University, Riyadh, Saudi Arabia for funding this work.(Corresponding author: Jiawen Kang)

Y. Liu is with the School of Data Science and Technology, HeilongjiangUniversity, Harbin, China (email: [email protected]). S. Garg is with theElectrical Engineering Department, Ecole de technologie superieure, Uni-versite du Quebec, Montreal, Canada. (e-mail: [email protected]). J. Nieis with Energy Research Institute, Interdisciplinary Graduate Programmeand School of Computer Science and Engineering, Nanyang TechnologicalUniversity, Singapore. (e-mail: [email protected]). Y. Zhang is HubeiKey Laboratory of Transportation Internet of Things, School of Com-puter Science and Technology, Wuhan University of Technology, China(email:[email protected]). Z. Xiong is with Alibaba-NTU Joint Re-search Institute and also School of Computer Science and Engineering,NTU, Singapore. (Email: [email protected]). J. Kang is with EnergyResearch Institute, Nanyang Technological University, Singapore (e-mail:[email protected]). M. Shamim Hossain is with the Department ofSoftware Engineering, College of Computer and Information Sciences, KingSaud University, Saudi Arabia (E-mail: [email protected]).

gradient compression, industrial internet of things

I. INTRODUCTION

THE widespread deployment of edge devices in the In-dustrial Internet of Things (IIoT) paradigm has spawned

a variety of emerging applications with edge computing,such as smart manufacturing, intelligent transportation, andintelligent logistics [1]. The edge devices provide powerfulcomputation resources to enable real-time, flexible, and quickdecision making for the IIoT applications, which has greatlypromoted the development of Industry 4.0 [2]. However, theIIoT applications are suffering from critical security riskscaused by abnormal IIoT nodes which hinders the rapiddevelopment of IIoT. For example, in smart manufacturingscenarios, industrial devices acting as IIoT nodes, e.g., engineswith sensors, that have abnormal behaviors (e.g., abnormaltraffic and irregular reporting frequency) may cause industrialproduction interruption thus resulting in huge economic lossesfor factories [3], [4]. Edge devices (e.g., industrial robots),generally collect sensing data from IIoT nodes, especiallytime-series data, to analyze and capture the behaviors andoperating condition of IIoT nodes by edge computing [5].Therefore, these sensing time series data can be used to detectthe anomaly behaviors of IIoT nodes [6].

To solve the abnormality problems from IIoT devices, typi-cal methods are to perform abnormal detection for the affectedIIoT devices [7]–[10]. Previous work focused on utilizing deepanomaly detection (DAD) [11] approaches to detect abnormalbehaviors of IIoT devices by analyzing sensing time seriesdata. DAD techniques can learn hierarchical discriminativefeatures from historical time-series data. In [12]–[14], the au-thors proposed a Long Short Term Memory (LSTM) networks-based deep learning model to achieve anomaly detection insensing time series data. Munir et al. in [15] proposed a novelDAD approach, called DeepAnT, to achieve anomaly detectionby utilizing deep Convolutional Neural Network (CNN) topredict anomaly value. Although the existing DAD approacheshave achieved success in anomaly detection, they cannotbe directly applied to the IIoT scenarios with distributededge devices for timely and accurate anomaly detection. Thereasons are two-fold: (i) the most of detection models arenot flexible enough in traditional approaches, the edge deviceslack dynamic and automatically updated detection models fordifferent scenarios, and hence the models fail to accurately

arX

iv:2

007.

0971

2v1

[cs

.LG

] 1

9 Ju

l 202

0

Page 2: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 2

predict frequently updated time-series data [8]; (2) due toprivacy concerns, the edge devices are not willing to share theirown collected time-series data with each other, thus the dataexists in the form of “islands.” The data islands significantlydegrade the performance of anomaly detection. Furthermore, itis often overlooked that the data may contain sensitive privateinformation, which leads to potential privacy leakage. Thereare some privacy issues in the anomaly detection context. Forexample, the anomaly detection model will reveal the patient’sheart disease history when detecting the patient’s abnormalpulse [16], [17].

To address the above challenges, a promising on-deviceprivacy-preserving distributed machine learning paradigm,called on-device federated learning (FL), was proposed foredge devices to train a global DAD model while keeps thetraining datasets locally without sharing raw training data[18]. Such a framework allows edge devices to collabora-tively train an on-device DAD model without compromisingprivacy. For example, the authors in [19] proposed an on-device FL-based approach to achieve collaborative anomalydetection. Tsukada et al. in [20] utilized FL framework topropose a Backpropagation Neural Networks (i.e., BPNNs)based approaches for anomaly detection. However, previousresearches ignore the communication overhead during modeltraining using FL among large-scale edge devices. Expensivecommunication overhead may cause excessive overhead andlong convergence time for edge devices so that the on-deviceDAD model cannot quickly detect anomalies. Therefore, it isnecessary to develop a communication-efficient on-device FLframework to achieve accurate and timely anomaly detectionfor edge devices.

In this paper, we propose a communication-efficient on-device FL framework that leverages an attention mechanism-based CNN-LSTM (AMCNN-LSTM) model to achieve accu-rate and timely anomaly detection for edge devices. First, weintroduce a FL framework to enable distributed edge devicesto collaboratively train a global DAD model without com-promising privacy. Second, we propose an AMCNN-LSTMmodel to detect anomalies. Specifically, we use attention-basedCNNs to extract fine-grained features of historical observation-sensing time-series data and use LSTM modules for time-series prediction. Such a model can prevent memory lossand gradient dispersion problems. Third, to further improvethe communication efficiency of the proposed framework, wepropose a gradient compression mechanism based on Top-k selection to reduce the number of gradients uploaded byedge devices. We evaluate the proposed framework on fourreal-world datasets: power demand, space shuttle, ECG, andengine. Experimental results show that the proposed frame-work can achieve high-efficiency communication and achieveaccurate and timely anomaly detection. The contributions ofthis paper are summarized as follows:• We introduce a federated learning framework to develop

an on-device collaborative deep anomaly detection modelfor edge devices in IIoT.

• We propose an attention mechanism-based CNN-LSTMmodel to detect anomalies, which uses CNN to capturethe fine-grained features of time series data and uses

LSTM module to accurately and timely detect anomalies.• We propose a Top-k selection-based gradient compression

scheme to improve the proposed framework’s communi-cation efficiency. Such a scheme decreases communica-tion overhead by reducing the exchanged gradient param-eters between the edge devices and the cloud aggregator.

• We conduct extensive experiments on four real-worlddatasets to demonstrate that the proposed frameworkcan accurately detect anomalies with low communicationoverhead.

II. RELATED WORK

A. Deep Anomaly Detection

Deep Anomaly Detection (DAD) has always been a hot is-sue in IIoT, which serves as a function of detecting anomalies.Previous researches about DAD generally can be divided intothree categories: supervised DAD, semi-supervised DAD, andunsupervised DAD approaches.

Supervised Deep Anomaly Detection: Supervised deepanomaly detection typically uses the labels of normal andabnormal data to train a deep-supervised binary or multi-class classifier. For example, Erfani et al. in [21] proposeda supervised Support Vector Machine (SVM) classifier forhigh-dimensional data to classify normal and abnormal data.Despite the success of supervised DAD methods in anomalydetection, these methods are not as popular as semi-supervisedor unsupervised methods due to the lack of labeled trainingdata [22]. Furthermore, the supervised DAD method has poorperformance for data with class imbalance (the total numberof positive class data is much larger than the total number ofnegative class data) [12].

Semi-supervised Deep Anomaly Detection: Since normalinstances are easier to obtain the labels than that of anomalies,semi-supervised DAD techniques are proposed to utilize asingle (normally positive class) existing label to separateoutliers [11]. For example, Wulsin et al. in [23] applied DeepBelief Nets (DBNs) in a semi-supervised paradigm to modelElectroencephalogram (EEG) waveforms for classification andanomaly detection. The semi-supervised DBN performance iscomparable to the standard classifier on EEG dataset. Thesemi-supervised DAD approach is popular because it can useonly a single class of labels to detect anomalies.

Unsupervised Deep Anomaly Detection: Unsuperviseddeep anomaly detection techniques use the inherent propertiesof data instances to detect outliers [11]. For example, Zong etal. in [24] proposed a deep automatic coding Gaussian Mix-ture Model (DAGMM) for unsupervised anomaly detection.Schlegl et al. in [25] proposed a deep convolutional generativeadversarial network, called AnoGAN, which detects abnormalanatomical images by learning a variety of normal anatomicalimages. They trained such a model in an unsupervised manner.Unsupervised DAD is widely used since it does not requirethe characteristics of labeled training data.

B. Communication-Efficient Federated Learning

Google proposed a privacy-preserving distributed machinelearning framework, called FL, to train machine learning mod-

Page 3: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 3

els without compromising privacy [26]. Inspired by this frame-work, different edge devices can contribute to the global modeltraining while keeping the training data locally. However, com-munication overhead is the bottleneck of FL being widely usedin IIoT [18]. Previous work has focused on designing efficientstochastic gradient descent (SGD) algorithms and using modelcompression to reduce the communication overhead of FL.Agarwal et al. in [27] proposed an efficient cpSGD algorithmto achieve communication-efficient FL. Reisizadeh et al. in[28] used Periodic Averaging and Quantization methods todesign a communication-efficient FL framework. Jeong et al.in [29] proposed a federated model distillation method toreduce the communication overhead of FL.

However, the above methods do not substantially reduce thenumber of gradients exchanged between edge devices and thecloud aggregator. The fact is that a large number of gradientsexchanged between the edge devices and the cloud aggregatormay cause excessive communication overhead for FL [30].Therefore, in this paper, we propose a Top-k selection-basedgradient compression scheme to improve the communicationefficiency of FL.

III. PRELIMINARY

In this section, we briefly introduce anomalies, federateddeep learning, and gradient compression as follows.

A. Anomalies

In statistics, anomalies (also called outliers and abnor-malities) are the data points that are significantly differentfrom other observations [11]. We assume that N1, N2, andN3 are regions composed of most observations, so they areregarded as normal data instance regions. If data points O1

and O2 are far from these regions, they can be classified asanomalies. To define anomalies more formally, we assumethat an n-dimensional dataset ~xi = (xi,1, . . . , xi,n) followsa normal distribution and its mean µj and variance σj foreach dimension where i ∈ {1, . . . ,m} and j ∈ {1, . . . , n}.Specifically, for j ∈ {1, . . . , n}, under the assumption of thenormal distribution, we have:

µj =

m∑i=1

xi,j/m, σ2j =

m∑i=1

(xi,j − µj)2 /m, (1)

if there is a new vector ~x, the probability p(~x) of anomalycan be calculated as follows:

p(~x) =

n∏j=1

p(xj ;µj , σ

2j

)=

n∏j=1

1√2πσj

exp

(− (xj − µj)2

2σ2j

).

(2)We can then judge whether vector ~x belongs to an anomaly

according to the probability value.

B. Federated Learning

Traditional distributed deep learning techniques require acertain amount of private data to be aggregated and analyzedat central servers (e.g., cloud servers) during the model trainingphase by using distributed stochastic gradient descent (D-SGD) algorithm [31]. Such the training process suffers from

potential data privacy leakage risks for IIoT devices. Toaddress such privacy challenges, a collaboratively distributeddeep learning paradigm, called federated deep learning, wasproposed for edge devices to train a global model whilekeeping the training datasets locally without sharing rawtraining data [18]. The procedure of FL is divided into threephases: the initialization phase, the aggregation phase, and theupdate phase. In the initialization phase, we consider that FLwith N edge devices and a parameter aggregator, i.e., a cloudaggregator, distributes a pre-trained global model ωt on thepublic datasets (e.g., MNIST [32], CIFAR-10 [33]) to eachedge devices. Following that, each device uses local dataset Dkof size Dk to train and improve the current global model ωt ineach iteration. In the aggregation phase, the cloud aggregatorcollects local gradients uploaded by the edge nodes (i.e., edgedevices). To do so, the local loss function to be optimized isdefined as follows:

minx∈Rd

Fk(x) =1

Dk

∑i∈Dk

Ezi∼Dkf(x; zi) + λh(x), (3)

where f(·; ·) is the local loss function for edge device k, ∀λ ∈[0, 1], h(·) is a regularizer function for edge device k, and∀i ∈ [1, · · · , n], zi is sampled from the local dataset Dk onthe k device. In the update phase, the cloud aggregator usesFederated Averaging (FedAVG) algorithm [26] to obtain a newglobal model ωt+1 for the next iteration, thus we have:

ωt+1 ← ωt +1

n

N∑n=1

Fnt+1, (4)

whereN∑n=1

Fnt+1 denotes model updates aggregation and

1n

N∑n=1

Fnt+1 denotes the average aggregation (i.e., FedAVG

algorithm). Both the edge devices and the cloud aggregatorrepeat the above process until the global model reachesconvergence. This paradigm significantly reduces the risks ofprivacy leakage by decoupling the model training from directaccess to the raw training data on edge nodes.

C. Gradient Compression

Large-scale FL training requires significant communicationbandwidth for gradient exchange, which limits the scalabilityof multi-nodes training [34]. In this context, Lin et al. in[34] stated that 99.9% of the gradient exchange in D-SGDis redundant. To avoid expensive communication bandwidthlimiting large-scale distributed training, gradient compressionis proposed to greatly reduce communication bandwidth. Re-searchers generally use gradient quantization [35] and gradientsparsification [36] to achieve gradient compression. Gradientquantization reduces communication bandwidth by quantizinggradients to low-precision values. Gradient sparsification usesthreshold quantization to reduce communication bandwidth.

For a fully connected (FC) layer in a deep neural network,we have: b = f(W ∗ a + v), where a is the input, v is thebias, W is the weight, f is the nonlinear mapping, and b is theoutput. This formula is the most basic operation in a neuralnetwork. For each specific neuron i, the above formula can

Page 4: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 4

IoT Edge Device With Sensors

Sensing Time Series Data From

Sensors

Training of On-device Federated Deep Learning

Model

On-device Deep Anomaly Detection

Fig. 1. The workflow of the on-device deep anomaly detection in IIoT.

be simplified to the following: bi = ReLU(∑n−1

j=0 Wijaj

),

Where ReLU is the activation function. Gradient compressioncompresses the corresponding weight matrix into a sparsematrix, and hence the corresponding formula is given asfollows:

bi = ReLU

∑j∈Xi∩Y

Sparse [Iij ] aj

, (5)

where∑j∈Xi∩Y S [Iij ] represents the compressed weight

matrix and i, j represent the position information of thegradient in the weight matrix W . Such a method reduces thecommunication overhead through sparsing the gradient in theweight matrix W .

IV. SYSTEM MODEL

We consider the generic setting for on-device deep anomalydetection in IIoT, where a cloud aggregator and edge de-vices work collaboratively to train a DAD model by usinga given training algorithm (e.g., LSTM) for a specific task(i.e., anomaly detection task), as illustrated in Fig. 1. Theedge devices train a shared global model locally on theirown local dataset (i.e., sensing time series data from IIoTnodes) and upload their model updates (i.e., gradients) tothe cloud aggregator. The cloud aggregator uses the FedAVGalgorithm or other aggregation algorithms to aggregate thesemodel updates and obtains a new global model. In the end,the edge devices will receive the new global model sent bythe cloud aggregator and use it to achieve accurate and timelyanomaly detection.

A. System Model Limitations

The proposed framework focus on a DAD model learningtask involving N distributed edge devices and a cloud ag-gregator. In this context, this framework has two limitations::missing labels and communication overhead.

For missing-label limitation, we assume that the labels of thetraining sample with proportion p (0 < p < 1) are missing.The lack of the label of the sample will cause the problemof class imbalance, thereby reducing the accuracy of DADmodel. For communication-overhead limitation, we considerthat there exists an excessive communication overhead when alarge number of gradients exchanged between the edge devicesand the cloud aggregator, which may make the model fail toconverge [29].

The above restrictions hinder the deployment of DADmodel in edge devices, which motivates us to develop acommunication-efficient FL-based unsupervised DAD frame-work to achieve accurate and timely anomaly detection.

Edge Device 1 Edge Device 2 Edge Device i

Cloud Aggregator

③ ③ ③

⑤ ⑤ ⑤

Gradient Compression Mechanism Local DAD Model

Local Dataset: Sensing Time Series Data From Sensors

Sparse𝛥𝛥 𝜔𝜔

Fig. 2. The overview of on-device communication-efficient deep anomalydetection framework in IIoT. This framework’s workflow consists of five steps,as follows: (i) The edge device uses the sensing time series data collected fromIIoT nodes as a local dataset (as shown in 1©). (ii) The edge device performslocal model (i.e., AMCNN-LSTM model) training on the local dataset (asshown in 2©). (iii) The edge device uploads the sparse gradients ∆ω to thecloud aggregator by using a gradient compression mechanism (as shown in3©). (iv) The cloud aggregator obtains a new global model by aggregating the

sparse gradients uploaded by the edge device (as shown in 4©). (v) The cloudaggregator sends the new global model to each edge device. The above stepsare executed cyclically until the global model reaches optimal convergence(as shown in 5©). Decentralized devices can use this optimal global model toperform anomaly detection tasks.

B. The Proposed Framework

We consider an on-device communication-efficient deepanomaly detection framework that involves multiple edgedevices for collaborative model training in IIoT, as illus-trated in Fig. 2. In particular, this framework consists of acloud aggregator and edge devices. Furthermore, the proposedframework also includes two mechanisms: an anomaly detec-tion mechanism and a gradient compression mechanism. Moredetails are described as follows:

• Cloud Aggregator: The cloud aggregator is generallya cloud server with strong computing power and richcomputing resources. The cloud aggregator contains twofunctions: (1) initializes the global model and sends theglobal model to the all edge devices; (2) aggregates thegradients uploaded by the edge devices until the modelconverges.

• Edge Devices: Edge devices are generally agents andclients, such as whirlpool, wind turbine, and vehicle,which contain local models and functional mechanisms(see below for more details). Each edge device uses thelocal dataset (i.e., sensing time series data from IIoTnodes) to train the global model sent by the cloud aggre-gator and uploads the gradients to the cloud aggregatoruntil the global model converges. The local model isdeployed in the edge device, and it can perform anomalydetection. In this paper, we use AMCNN-LSTM modelto detect anomalies, which uses CNN to capture the fine-grained features of sensing time series data and usesLSTM module to accurately and timely detect anomalies.

The functions of mechanisms are described as follows:

Page 5: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 5

AMCNN-LSTM Model

Preprocessing

LSTM BlockSensing Time Series D

ata

Sequence

Attention & CNN Unit

CNNs

Prediction

LSTM Unit

LSTM

LSTM

LSTM

Sequence…

Sequence

Attention

CNNs

Attention

CNNs

Attention

Output

Fig. 3. The overview of the attention mechanism-based CNN-LSTM Model.

• Deep Anomaly Detection Mechanism: The deepanomaly detection mechanism is deployed in the edgedevices, which can detect anomalies to reduce economiclosses.

• Gradient Compression Mechanism: The gradient com-pression mechanism is deployed in the edge devices,which can compress the local gradients to reduce thenumber of gradients exchanged between the edge devicesand the cloud aggregator, thereby reducing communica-tion overhead.

C. Design Goals

In this paper, our goal is to develop an on-devicecommunication-efficient FL framework for deep anomalydetection in IIoT. First, the proposed framework needs todetect anomalies accurately in an unsupervised manner. Theproposed framework uses an unsupervised AMCNN-LSTMmodel to detect anomalies. Second, the proposed frameworkcan significantly improve communication efficiency by usinga gradient compression mechanism. Third, the performanceof the proposed framework is comparable to traditional FLframeworks.

V. A COMMUNICATION-EFFICIENT ON-DEVICE DEEPANOMALY DETECTION FRAMEWORK

In this section, we first present the attention mechanism-based CNN-LSTM model. This model uses CNN to capturethe fine-grained features of sensing time series data and usesLSTM module to accurately and timely detect anomalies.We then propose a deep gradient compression mechanism tofurther improve the communication efficiency of the proposedframework.

A. Attention Mechanism-based CNN-LSTM Model

We present an unsupervised AMCNN-LSTM model includ-ing an input layer, an attention mechanism-based CNN unit,an LSTM unit, and an output layer shown in Fig. 3. First, weuse the preprocessed data as input to the input layer. Second,we use CNN to capture the fine-grained features of the inputand utilize the attention mechanism to focus on the importantfeatures of CNN captured features. Third, we use the outputof the attention mechanism-based CNN unit as the input ofthe LSTM unit and use LSTM to predict future time-series

data. Finally, we propose an anomaly detection score to detectanomalies.

Preprocessing: We normalize the sensing time series datacollected by the IIoT nodes into [0,1] to accelerate the modelconvergence.

Attention Mechanism-based CNN Unit: First, we intro-duce an attention mechanism in CNN unit to improve thefocus on important features. In cognitive science, due to thebottleneck of information processing, humans will selectivelyfocus on important parts of information while ignoring othervisible information [37]. Inspired by the above facts, attentionmechanisms are proposed for various tasks, such as computervision and natural language processing [37]–[39]. Therefore,the attention mechanism can improve the performance of themodel by paying attention to important features. The formaldefinition of the attention mechanism is given as follows:

ei = a(u,vi) (Compute Attention Scores),αi =

ei∑i ei

(Normalize),

c =∑i

αivi (Encode),(6)

where u is the matching feature vector based on the currenttask and is used to interact with the context, vi is the featurevector of a timestamp in the time series, ei is the unnormalizedattention score, αi is the normalized attention score, and cis the context feature of the current timestamp calculatedbased on the attention score and feature sequence v. In mostinstances, ei = uTWv, where W is the weight matrix.

Second, we use CNN unit to extract fine-grained featuresof time series data. The CNN module is formed by stackingmultiple layers of one-dimensional (1-D) CNN, and each layerincludes a convolution layer, a batch normalization layer,and a non-linear layer. Such modules implement samplingaggregation by using pooling layers and create hierarchicalstructures that gradually extract more abstract features throughthe stacking of convolutional layers. This module outputs mfeature sequences of length n, and the size can be expressed as(n×m). To further extract significant time-series data features,we propose a parallel feature extraction branch by combiningthe attention mechanisms and CNN. The attention mechanismmodule is composed of feature aggregation and scale restora-tion. The feature aggregation part uses the stacking of multipleconvolutions and pooling layers to extract key features fromthe sequence and uses a convolution kernel of size 1 × 1 tomine the linear relationship. The scale restoration part restoresthe key features to (n×m), which is consistent with the size ofthe output features of CNN module, and then uses the sigmoidfunction to constrain the values to [0,1].

Third, we multiply element-wise the output features ofCNN module and the output of the important features bythe corresponding attention mechanism module. We assumethat the sequence Xi = {xi1, xi2, · · · , xin}(0 ≤ i < I). Theoutput of the sequence Xi processed by CNN module isrepresented by WCNN, and the output of the correspondingattention module is represented as Wattention. We multiplythe two outputs element by element, as follows:

W (i, c) =WCNN(i, c)�Wattention(i, c), (7)

Page 6: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 6

where � represents element-wise multiplication, i is the cor-responding position of the time series in the feature layer, andc is the channel. We use the final feature layer W (i, c) as theinput of LSTM block.

We introduce the attention mechanism to expand the recep-tive field of the input, which allows the model to obtain morecomprehensive contextual information, thereby learning theimportant features of the current local sequence. Furthermore,we use the attention module to suppress the interference ofunimportant features to the model, thereby solving the problemthat the model cannot distinguish the importance of the timeseries data features.

LSTM Unit: In this paper, we use a variant of a recurrentneural network, called LSTM, to support accurately predict thesensing time series data to detect anomalies, as shown in Fig.3. LSTM uses a well-designed “gate” structure to remove oradd information to the state of the cell. The “gate” structureis a method of selectively passing information. LSTM cellsinclude forget gates ft, input gates it, and output gates ot.The calculations on the three gate structures are defined asfollows:

ft = σl(Wf · [ht−1, xt] + bf ),it = σl(Wi · [ht−1, xt] + bi),

Ct = tanh(WC · [ht−1, xt] + bC),

Ct = ft ∗ Ct−1 + it ∗ Ct,ot = σl(Wo · [ht−1, xt] + bo),ht = ot ∗ tanh(Ct),

(8)

where Wf ,Wi,WC ,Wo, and bf , bi, bC , bo are the weightmatrices and the bias vectors for input vector xt at time step t,respectively. σl is the activation function, ∗ represents element-wise multiplication of a matrix, Ct represents the cell state,ht−1 is the state of the hidden layer at time step t− 1, and htis the state of the hidden layer at time step t.

Anomaly Detection: We use AMCNN-LSTM model to pre-dict real-time and future sensing time series data in differentedge devices:

[xin−T+1, xin−T+2, · · · , xin]

f(·)→ [xin+1, xin+2, · · · , xin+T ], (9)

where f(·) is the prediction function. In this paper, we useLSTM unit for time series prediction. We use anomaly scoresfor anomaly detection, which is defined as follows:

An = (βn − µ)Tσ−1(βn − µ), (10)

where An is the anomaly score, βn = |xin − x′in | is the

reconstruction error vector, and the error vectors βn for thetime series in the sequences Xi are used to estimate theparameters µ and σ of a Normal distribution N (µ;σ) usingMaximum Likelihood Estimation.

In an unsupervised setting, when An ≥ ς (ς = maxFθ =(1+θ2)×P×R

θ2P+R ), where P is precision, R is recall, and θ isthe parameter, a point in a sequence can be predicted to be“anomalous”, otherwise “normal”.

B. Gradient Compression Mechanism

If the gradients reach 99.9% sparsity, only the 0.1% gra-dients with the largest absolute value are useful for model

aggregation [30]. Therefore, we only need to aggregate thegradient with a larger absolute value to update the model. Thisway reduces the byte size of the gradient matrix, which canreduce the number of gradients exchanged between the deviceand the cloud to improve communication efficiency, especiallyfor distributed machine learning systems. Inspired by theabove facts, we propose a gradient compression mechanism toreduce the gradients exchanged between the cloud aggregatorand the edge devices. We expect that this mechanism canfurther improve the communication efficiency of the proposedframework.

When we choose a gradient with a larger absolute value,we will meet the following situations: (1) All gradient valuesin the gradient matrix are not greater than the given threshold;(2) There are some gradient values in the gradient matrix thatare very close to the given threshold. If we set these gradientsthat do not meet the threshold requirements to 0, it will causeinformation loss. Therefore, the device uses a local gradientaccumulation scheme to prevent information loss. Specifically,the cloud returns smaller gradients to the device instead offiltering the gradients. The device keeps the smaller gradientin the buffer and accumulates all the smaller gradients untilit reaches a given threshold. Note that we use D-SGD foriterative updates, and the loss function to be optimized isdefined as follows:

F (ω) =1

Dk

∑x∈Dk

f(x, ω), (11)

ωt+1 = ωt − η1

Nb

N∑k=1

∑x∈Bk,t

∇f (x, ωt) , (12)

where F (ω) is the loss function, f(x, ω) is the loss functionfor the local device, ω are the weights of the model, N isthe total edge devices, η is the learning rate, Bk,t representsthe data sample for the t-th round of training, and each localdataset size of b.

When the gradients’ sparsification reaches a high value (e.g.,99%), it will affect the model convergence. By following[30], [36], we use momentum correction and local gradientclipping to mitigate this effect. Momentum correction canmake the accumulated small local gradients converge towardthe gradients with a larger absolute value, thereby acceleratingthe model’s convergence speed. Local gradient clipping is usedto alleviate the problem of gradient explosions [30]. Next,we prove that local gradient accumulation scheme will notaffect the model convergence: We assume that g(i) is the i-th gradient, u(i) denotes the sum of the gradients using theaggregation algorithm in [26], v(i) denotes the sum of thegradients using the local gradient accumulation scheme, andm is the rate of gradient descent. If the i-th gradient does notexceed threshold until the (t− 1)-th iteration and triggers themodel update, we have:

u(i)t−1 = mt−2g

(i)1 + · · ·+mg

(i)t−2 + g

(i)t−1, (13)

v(i)t−1 =

(1 + · · ·+mt−2) g(i)1 + · · ·+ (1 +m)g

(i)t−2 + g

(i)t−1,

(14)

Page 7: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 7

then we can update ω(i)t = w

(i)1 − η × v

(i)t−1 and set v(i)t−1 = 0.

If the i-th gradient reaches the threshold at the t-th iteration,model update is triggered, thus we have:

u(i)t = mt−1g

(i)1 + · · ·+mg

(i)t−1 + g

(i)t , (15)

v(i)t = mt−1g

(i)1 + · · ·+mg

(i)t−1 + g

(i)t . (16)

Then we can update ω(i)t+1 = ω

(i)t − η × v

(i)t = ω1

(i) −η ×

[(1 + · · ·+mt−1) g(i)1 + · · ·+ (1 +m)g

(i)t−1 + g

(i)t

]=

w(i)1 − η × v

(i)t−1, so the result of using the local gradient

accumulation scheme is consistent with the usage effect ofthe optimization algorithm in [26].

The specific implementation phases of the gradient com-pression mechanism are given as follows:

i) Phase 1, Local Training: Edge devices use the localdataset to train the local model. In particular, we use thegradient accumulation scheme to accumulate local smallgradients.

ii) Phase 2, Gradient Compression: Each edge device usesAlgorithm 1 to compress the gradients and upload sparsegradients (i.e., only gradients larger than a thresholdare transmitted.) to the cloud aggregator. Note that theedge devices send the remaining local gradient to thecloud aggregator when the local gradient accumulationis greater than a threshold.

iii) Phase 3, Gradient Aggregation: The cloud aggregatorobtains the global model by aggregating sparse gradientsand sends this global model to the edge devices.

The gradient compression algorithm is thus presented inAlgorithm 1.

VI. EXPERIMENTS

In this section, the proposed framework is applied to fourreal-world datasets, i.e., power demand 1 , space shuttle 2, ECG3, and engine 4 for performance demonstration. These datasetsare time series datasets collected by different types of sensorsfrom different fields [6]. For example, the power demanddataset is composed of electricity consumption data recordedby the electricity meter. There are normal subsequences andanomalous subsequences in these datasets. As shown in TableI, X , Xn, and Xa is a number of original sequences, nor-mal subsequences, and anomalous subsequences, respectively.For the power demand dataset, the anomalous subsequencesindicate that the electricity meter has failed or stop working.Therefore, we need to use these datasets to train a FL modelthat can detect anomalies. We divide all datasets into a trainingset and a test set in a 7: 3 ratio. We implement the proposedframework by using Pytorch and PySyft [40]. The experimentis conducted on a virtual workstation with the Ubuntu 18.04operation system, Intel (R) Core (TM) i5-4210M CPU, 16GBRAM, 512GB SSD.

1https://archive.ics.uci.edu/ml/datasets/2https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)3https://physionet.org/about/database/4https://archive.ics.uci.edu/ml/datasets.php

Algorithm 1: Gradient compression mechanism on edgenode k.Input: G = {g1, g2, . . . , gk} is the edge node’s gradient,

B is the local mini-batch size, Dk is the localdataset, η is the learning rate, f(·, ·) is the edgenode’s loss function, and the optimizationfunction SGD.

Output: Parameter ω.1 Initialize parameter ωt;2 gk ← 0;3 for t = 0, 1, · · · do4 gkt ← gkt−1;5 for i = 1, 2, · · · do6 Sample data x from Dk;7 gkt ← gkt−1 +

1|Dk|B∇f(x;ωt);

8 if Gradient Clipping then9 gkt ← Local Gradient Clipping (gkt );

10 foreach gkjt ∈ {gkt } and j = 1, 2, · · · do

11 Thr← |Top ρ% of {gkt }|;12 if |gkjt |≥ Thr then13 Send this gradient to the cloud aggregator;

14 if |gkjt |< Thr then15 The edge node k uses the local gradient

accumulation scheme to accumulate gradientsuntil the gradient reaches Thr;

16 Aggregate gkt : gt ←∑Nk=1 (sparse g

kt );

17 ωt+1 ← SGD (ωt, gt).

18 return ω.

,TABLE I

DETAILS OF FOUR REAL-WORLD DATASETS

Datasets Dimensions X Xn Xa

Power Demand 1 1 45 6Space Shuttle 1 3 20 8

ECG 1 1 215 1Engine 12 30 240 152

A. Evaluation Setup

In this experiment, to determine the hyperparameter ρ of thegradient compression mechanism, we first apply a simple CNNnetwork (i.e., CNN with 2 convolutional layers followed by 1fully connected layer) in the proposed framework to performthe classification task on MNIST and CIFAR-10 dataset. Thepixels in all datasets are normalized into [0,1]. During thesimulation, the number of edge devices is N = 10, the learningrate is η = 0.001, the training epoch is E = 1000, the mini-batch size is B = 128, and we follow reference [41] and setθ as 0.05.

We adopt Root Mean Square Error (RMSE) to indicate theperformance of AMCNN-LSTM model as follows:

RMSE = [1

n

n∑i=1

(|yi − yp|)2]12 , (17)

where yi is the observed sensing time series data, and yp is

Page 8: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 8

0.1 0.2 0.3 0.5 0.9 1 100 (%)

0

20

40

60

80

100

Accu

racy

(%) MNIST

CIFAR-10

Fig. 4. The accuracy of the proposed framework with different ρ on MNISTand CIFAR-10 datasets.

Power Demand Space Shuttle ECG EngineDatasets

75

80

85

90

95

100

Accu

racy

(%)

AMCNN-LSTMCNN-LSTMLSTMGRUSAEsSVM

Fig. 5. Performance comparsion of detection accuracy for AMCNN-LSTM,CNN-LSTM, LSTM, GRU, SAEs, and SVM on different datasets: powerdemand, space shuttle, ECG, and engine.

the predicted sensing time series data.

B. Hyperparameters Selection of the Proposed Framework

In the context of deep gradient compression scheme, properhyperparameter selection, i.e., a threshold of absolute gradi-ent value, is a notable factor that determines the proposedframework performance. In this section, we investigate theperformance of the proposed framework with different thresh-olds and try to find a best-performing threshold for it. Inparticular, we employ ρ ∈ {0.1, 0.2, 0.3, 0.5, 0.9, 1, 100} toadjust the best threshold of the proposed framework. We useMNIST and CIFAR-10 datasets to evaluate the performanceof the proposed framework with the selected threshold. Asshown in Fig. 4, we observe that the larger ρ, the better theperformance of the proposed framework. For MNIST task, theresults show that when ρ = 0.3, the accuracy is 97.25%; whenρ = 100, the accuracy is 99.08%. This means that the modelincreases gradient size by about 300 times, but the accuracyis only improved by 1.83%. Furthermore, we observe a trade-off between the gradient threshold and accuracy. Therefore, toachieve a good trade-off between the gradient threshold andaccuracy, we choose ρ = 0.3 as the best threshold of ourscheme.

C. Performance of the Proposed Framework

We compared the performance of the proposed modelwith that of CNN-LSTM [42], LSTM [41], Gate Recurrent

Power Demand Space Shuttle ECG EngineDatasets

0

5

10

15

20

25

30

RMSE

AMCNN-LSTMCNN-LSTMLSTMGRUSAEsSVM

Fig. 6. Performance comparsion of RMSE for AMCNN-LSTM, CNN-LSTM,LSTM, GRU, SAEs, and SVM on different datasets: power demand, spaceshuttle, ECG, and engine.

Unit (GRU) [43], Stacked Auto Encoders (SAEs) [44], andSupport Machine Vector (SVM) [45] method with an identicalsimulation configuration. Among these competing methods,AMCNN-LSTM is a FL-based model, and the rest of the meth-ods are centralized ones. All models are popular DAD modelsfor general anomaly detection applications. We evaluate thesemodels on four real-world datasets, i.e., power demand, spaceshuttle, ECG, and engine.

First, we compare the accuracy of the proposed model withcompeting methods in anomaly detection. We determine themaxFθ and hyperparameter ς based on the accuracy and recallof the model on the training set. The hyperparameters ς of thedataset power demand, space shuttle, ECG, and engine are0.75, 0.80, 0.80, and 0.60. In Fig. 5, experimental results showthat the proposed model achieves the highest accuracy on allfour datasets. For example, for the dataset power demand, theaccuracy of AMCNN-LSTM model is 96.85%, which is 7.87%higher than that of SVM model. From the experimental results,AMCNN-LSTM has better robustness to different datasets.The reason is that we use the on-device FL framework totrain and update the model, which can learn the time-seriesfeatures from different edge devices as much as possible,thereby improving the robustness of the model. Furthermore,the FL framework provides opportunities for edge devices toupdate models in a timely manner. This helps the edge deviceowner to update the model on the edge devices in time.

Second, we need to evaluate the prediction error of theproposed model and the competing methods. As shown in Fig.5, experimental results show that the proposed model achievesthe best performance on four real-world datasets. For the ECGdataset, RMSE of AMCNN-LSTM model is 63.9% lower thanthat of SVM model. The reason is that AMCNN-LSTM modeluses AMCNN units to capture important fine-grained featuresand prevent memory loss and gradient dispersion problems.Memory loss and gradient dispersion problems often occurin encoder-decoder models such as LSTM and GRU models.Furthermore, the proposed model retains the advantages ofLSTM unit in predicting time series data. Therefore, theproposed model can accurately predict time series data.

Therefore, the proposed model not only accurately detectsabnormalities, but also accurately predicts time series data.

Page 9: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 9

AMCNN-LSTM CNN-LSTM LSTM GRU SAEs SVMModels

0

20

40

60

80Ru

nnin

g Ti

me

(min

) FL with GCMFL without GCM

Fig. 7. Comparison of communication efficiency between FL with GCM andFL without GCM with different models.

D. Communication Efficiency of the Proposed Framework

In this section, we compare the communication efficiencybetween FL framework with the gradient compression mecha-nism (GCM) and the traditional FL framework without GCM.We apply the same model (i.e., AMCNN-LSTM, CNN-LSTM,LSTM, GRU, SAEs, and SVM) in the proposed frameworkand the traditional FL framework. Note that we fix thecommunication overhead of each round, so we can comparethe running time of the model to compare the communicationefficiency. In Fig. 7, we show the running time of FL withGCM and FL without GCM using different models. As shownin Fig. 7, we observe that the running time of FL frameworkwith GCM is about 50% that of the framework without GCM.The reason is that GCM can reduce the number of gradientsexchanged between the edge devices and the cloud aggregator.In section V-B, we show that GCM can compress the gradientby 300 times without compromising accuracy. Therefore, theproposed communication efficient framework is practical andeffective in real-world applications.

E. Discussion

Due to the trade-off between privacy and model perfor-mance, we will discuss the privacy analysis of the proposedframework in terms of data access and model performance:• Data Access: FL framework allows edge devices to

keep the dataset locally and collaboratively learn deeplearning models, which means that any third party cannotaccess the user’s raw data. Therefore, the FL-based modelcan achieve anomaly detection without compromisingprivacy.

• Model Performance: Although the FL-based model canprotect privacy, the model performance is still an im-portant metric to measure the quality of the model. Itcan be seen from the experimental results that the per-formance of the proposed model is comparable to manyadvanced centralized machine learning models, such asCNN-LSTM, LSTM, GRU, and SVM model. In otherwords, the proposed model makes a good compromisebetween privacy and model performance.

VII. CONCLUSION

In this paper, we propose a novel communication-efficienton-device FL-based deep anomaly detection framework for

sensing time series data in IIoT. First, we introduce a FLframework to enable decentralized edge devices to collabo-ratively train an anomaly detection model, which can solvethe problem of data islands. Second, we propose an atten-tion mechanism-based CNN-LSTM (AMCNN-LSTM) modelto accurately detect anomalies. AMCNN-LSTM model usesattention mechanism-based CNN units to capture importantfine-grained features and prevent memory loss and gradientdispersion problems. Furthermore, this model retains the ad-vantages of LSTM unit in predicting time series data. Weevaluate the performance of the proposed model on fourreal-world datasets and compare it with CNN-LSTM, LSTM,GRU, SAEs, and SVM methods. The experimental resultsshow that the AMCNN-LSTM model can achieve the highestaccuracy on all four datasets. Third, we propose a gradientcompression mechanism based on Top-k selection to improvecommunication efficiency. Experimental results validate thatthis mechanism can compress the gradient by 300 timeswithout losing accuracy. To the best of our knowledge, thisis one of the pioneering work for deep anomaly detection byusing on-device FL.

In the future, we will focus on researching privacy-enhancedFL frameworks and more robust anomaly detection models.The reason is that the FL framework is vulnerable to maliciousattacks by malicious participants and a more robust model canbe applied to a wider range of application scenarios.

REFERENCES

[1] H. Peng and X. Shen, “Deep reinforcement learning based resourcemanagement for multi-access edge computing in vehicular networks,”IEEE Transaction on Network Science and Engineering, to appear.

[2] Y. Wu et al., “Dominant data set selection algorithms for electricityconsumption time-series data analysis based on affine transformation,”IEEE Internet of Things Journal, vol. 7, no. 5, pp. 4347–4360, 2020.

[3] Y. Peng et al., “Hierarchical edge computing: A novel multi-sourcemulti-dimensional data anomaly detection scheme for industrial internetof things,” IEEE Access, vol. 7, pp. 111 257–111 270, 2019.

[4] H. Peng et al., “Toward energy-efficient and robust large-scale wsns:a scale-free network approach,” IEEE Journal on Selected Areas inCommunications, vol. 34, no. 12, pp. 4035–4047, 2016.

[5] P. Malhotra et al., “Lstm-based encoder-decoder for multi-sensoranomaly detection,” arXiv preprint arXiv:1607.00148, 2016.

[6] T. Luo et al., “Distributed anomaly detection using autoencoder neuralnetworks in wsn for iot,” in 2018 IEEE International Conference onCommunications. IEEE, 2018, pp. 1–6.

[7] S. Garg et al., “A hybrid deep learning-based model for anomalydetection in cloud datacenter networks,” IEEE Transactions on Networkand Service Management, vol. 16, no. 3, pp. 924–935, 2019.

[8] T. D. Nguyen et al., “Dıot: A federated self-learning anomaly detec-tion system for iot,” in 2019 IEEE 39th International Conference onDistributed Computing Systems. IEEE, 2019, pp. 756–767.

[9] S. Garg et al., “A multi-stage anomaly detection scheme for augmentingthe security in iot-enabled applications,” Future Generation ComputerSystems, vol. 104, pp. 105–118, 2020.

[10] S. Garg et al., “En-abc: An ensemble artificial bee colony basedanomaly detection scheme for cloud environment,” Journal of Paralleland Distributed Computing, vol. 135, pp. 219–233, 2020.

[11] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: Asurvey,” arXiv preprint arXiv:1901.03407, 2019.

[12] P. Malhotra et al.,“Long short term memory networks for anomalydetection in time series,” in Proceedings, vol. 89. Presses universitairesde Louvain, 2015.

[13] W. Lu et al., “Deep hierarchical encoding model for sentence semanticmatching,” Journal of Visual Communication and Image Representation,p. 102794, 2020.

[14] H. Lu et al., “Deep fuzzy hashing network for efficient image retrieval,”IEEE Transactions on Fuzzy Systems, 2020.

Page 10: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 10

[15] M. Munir et al., “Deepant: A deep learning approach for unsupervisedanomaly detection in time series,” IEEE Access, vol. 7, pp. 1991–2005,2018.

[16] E. Lundin et al., “Anomaly-based intrusion detection: privacy concernsand other problems,” Computer networks, vol. 34, no. 4, pp. 623–640,2000.

[17] I. Butun et al., “Anomaly detection and privacy preservation in cloud-centric internet of things,” in 2015 IEEE International Conference onCommunication Workshop. IEEE, 2015, pp. 2610–2615.

[18] Y. Liu et al., “Privacy-preserving traffic flow prediction: A federatedlearning approach,” IEEE Internet of Things Journal, pp. 1–1, 2020.

[19] R. Ito et al., “An on-device federated learning approach for cooperativeanomaly detection,” arXiv preprint arXiv:2002.12301, 2020.

[20] M. Tsukada et al., “A neural network based on-device learning anomalydetector for edge devices,” arXiv preprint arXiv:1907.10147, 2019.

[21] S. M. Erfani et al., “High-dimensional and large-scale anomalydetection using a linear one-class svm with deep learning,” PatternRecognition, vol. 58, pp. 121–134, 2016.

[22] T. S. Buda et al., “Deepad: A generic framework based on deeplearning for time series anomaly detection,” in Pacific-Asia Conferenceon Knowledge Discovery and Data Mining. Springer, 2018, pp. 577–588.

[23] D. Wulsin et al., “Modeling electroencephalography waveforms withsemi-supervised deep belief nets: fast classification and anomaly mea-surement,” Journal of neural engineering, vol. 8, no. 3, p. 036015, 2011.

[24] B. Zong et al., “Deep autoencoding gaussian mixture model forunsupervised anomaly detection,” 2018.

[25] T. Schlegl et al., “Unsupervised anomaly detection with generativeadversarial networks to guide marker discovery,” in International con-ference on information processing in medical imaging. Springer, 2017,pp. 146–157.

[26] J. Konecny et al., “Federated learning: Strategies for improving com-munication efficiency,” arXiv preprint arXiv:1610.05492, 2016.

[27] N. Agarwal et al., “cpsgd: Communication-efficient and differentially-private distributed sgd,” in Advances in Neural Information ProcessingSystems, 2018, pp. 7564–7575.

[28] A. Reisizadeh et al., “Fedpaq: A communication-efficient federatedlearning method with periodic averaging and quantization,” arXivpreprint arXiv:1909.13014, 2019.

[29] E. Jeong et al., “Communication-efficient on-device machine learning:Federated distillation and augmentation under non-iid private data,”arXiv preprint arXiv:1811.11479, 2018.

[30] J. Wangni et al., “Gradient sparsification for communication-efficientdistributed optimization,” in Advances in Neural Information ProcessingSystems, 2018, pp. 1299–1309.

[31] L. Zhao et al., “Shielding collaborative learning: Mitigating poisoningattacks through client-side detection,” arXiv preprint arXiv:1910.13111,2019.

[32] Y. LeCun et al., “Gradient-based learning applied to document recog-nition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[33] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of featuresfrom tiny images,” 2009.

[34] Y. Lin et al., “Deep gradient compression: Reducing the communicationbandwidth for distributed training,” arXiv preprint arXiv:1712.01887,2017.

[35] D. Alistarh et al., “Qsgd: Communication-efficient sgd via gradientquantization and encoding,” in Advances in Neural Information Pro-cessing Systems, 2017, pp. 1709–1720.

[36] J. Wangni et al., “Gradient sparsification for communication-efficientdistributed optimization,” in Advances in Neural Information ProcessingSystems, 2018, pp. 1299–1309.

[37] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” arXiv preprint arXiv:1409.0473,2014.

[38] V. Mnih et al., “Recurrent models of visual attention,” in Advances inneural information processing systems, 2014, pp. 2204–2212.

[39] A. Vaswani et al., “Attention is all you need,” in Advances in neuralinformation processing systems, 2017, pp. 5998–6008.

[40] T. Ryffel et al., “A generic framework for privacy preserving deeplearning,” arXiv preprint arXiv:1811.04017, 2018.

[41] P. Malhotra et al.,“Lstm-based encoder-decoder for multi-sensoranomaly detection,” arXiv preprint arXiv:1607.00148, 2016.

[42] T.-Y. Kim and S.-B. Cho, “Web traffic anomaly detection using c-lstmneural networks,” Expert Systems with Applications, vol. 106, pp. 66–76,2018.

[43] Y. Guo et al., “Multidimensional time series anomaly detection: Agru-based gaussian mixture variational autoencoder approach,” in AsianConference on Machine Learning, 2018, pp. 97–112.

[44] N. Chouhan, A. Khan et al., “Network anomaly detection using channelboosted and residual learning based deep convolutional neural network,”Applied Soft Computing, vol. 83, p. 105612, 2019.

[45] S. M. Erfani et al., “High-dimensional and large-scale anomalydetection using a linear one-class svm with deep learning,” PatternRecognition, vol. 58, pp. 121–134, 2016.

Yi Liu (S’19) received the B.Eng. degree in networkengineering from Heilongjiang University, Harbin,China, in 2019. He is currently pursuing a Ph.D.degree at the Faculty of Information Technology,Monash University, Melbourne, Australia. His re-search interests include security & privacy, federatedlearning, edge computing, and blockchain.

Sahil Garg (S’15, M’18) is a postdoctoral researchfellow at cole de technologie suprieure, Universitdu Qubec, Montral, Canada. He received his Ph.D.degree from the Thapar Institute of Engineeringand Technology, Patiala, India, in 2018. He hasmany research contributions in the area of machinelearning, big data analytics, security and privacy, theInternet of Things, and cloud computing. He hasover 50 publications in high ranked journals and con-ferences, including 25+ IEEE transactions/journalpapers. He received the IEEE ICC best paper award

in 2018 in Kansas City, Missouri. He serves as the Managing Editorof Springers Human-Centric Computing and Information Sciences journal.He is also an Associate Editor of IEEE Network, IEEE System Journal,Elseviers Applied Soft Computing, Future Generation Computer Systems,and Wileys International Journal of Communication Systems. In addition,he also serves as a Workshops and Symposia Officer of the IEEE ComSocEmerging Technology Initiative on Aerial Communications. He has guestedited a number of Special Issues in top-cited journals including IEEE T-ITS, IEEE TII, the IEEE IoT Journal, IEEE Network, and Future GenerationComputer Systems. He serves/served as the Workshop Chair/Publicity Co-Chair for several IEEE/ACM conferences including IEEE INFOCOM, IEEEGLOBECOM, IEEE ICC, ACM MobiCom, and more. He is a member ofACM.

Jiangtian Nie received her B.Eng degree with hon-ors in Electronics and Information Engineering fromHuazhong University of Science and Technology,Wuhan, China, in 2016. She is currently workingtowards the Ph.D. degree with ERI@N in the In-terdisciplinary Graduate School, Nanyang Techno-logical University, Singapore. Her research interestsinclude incentive mechanism design in crowdsensingand game theory.

Page 11: IEEE INTERNET OF THINGS JOURNAL 1 Deep Anomaly Detection ... · dustrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such

IEEE INTERNET OF THINGS JOURNAL 11

Yang Zhang (M11) is currently an associate pro-fessor at the School of Computer Science and Tech-nology, Wuhan University of Technology, China. Hereceived B. Eng. and M. Eng. from Beijing Univer-sity of Aeronautics and Astronautics in 2008 and2011, respectively, and obtained a Ph.D. degree inComputer Engineering from Nanyang TechnologicalUniversity (NTU), Singapore, in 2015. He is anassociate editor of EURASIP Journal on WirelessCommunications and Networking, and a technicalcommittee member of Computer Communications

(Elsevier). He is also an advisory expert member of Shenzhen FinTechLaboratory, China. His current research interests include: market-orientedmodeling for network resource allocation, multiple agent machine learning,and deep reinforcement learning in network systems.

Zehui Xiong (S’17, M’20) is currently a researcherwith Alibaba-NTU Singapore Joint Research In-stitute, Singapore. He obtained the B.Eng degreewith the highest honors in Telecommunications En-gineering at Huazhong University of Science andTechnology, Wuhan, China, in Jul 2016. He re-ceived the Ph.D. degree in Computer Science andEngineering at Nanyang Technological University,Singapore, in Apr 2020. He was a visiting scholarat Princeton University and University of Waterloo.His research interests include resource allocation in

wireless communications, network games and economics, blockchain, andedge intelligence. He has won several Best Paper Awards including IEEEWCNC 2020, and IEEE Vehicular Technology Society Singapore Best PaperAward in 2019. He is an Editor for Elsevier Computer Networks and ElsevierPhysical Communication, and an Associate Editor for IET Communications.He serves as a Guest Editor for IEEE Transactions on Cognitive Communi-cations and Networking, IEEE Open Journal of Vehicular Technology, andEURASIP Journal on Wireless Communications and Networking. He is therecipient of the Chinese Government Award for Outstanding Students Abroadin 2019, and NTU SCSE Outstanding PhD Thesis Runner-Up Award in 2020.

Jiawen Kang received the M.S. degree from theGuangdong University of Technology, China, in2015, and the Ph.D. degree at the same schoolin 2018. He is currently a postdoc at NanyangTechnological University, Singapore. His researchinterests mainly focus on blockchain, security andprivacy protection in wireless communications andnetworking.

M. Shamim Hossain (SM’19) is a Professor at the Department of SoftwareEngineering, College of Computer and Information Sciences, King SaudUniversity, Riyadh, Saudi Arabia. He is also an adjunct professor at theSchool of Electrical Engineering and Computer Science, University of Ottawa,Canada. He received his Ph.D. in Electrical and Computer Engineeringfrom the University of Ottawa, Canada. His research interests include cloudnetworking, smart environment (smart city, smart health), AI, deep learn-ing, edge computing, Internet of Things (IoT), multimedia for health care,and multimedia big data. He has authored and coauthored more than 250publications including refereed journals, conference papers, books, and bookchapters. Recently, he co-edited a book on “Connected Health in SmartCities”, published by Springer. He has served as cochair, general chair,workshop chair, publication chair, and TPC for over 12 IEEE and ACMconferences and workshops. Currently, he is the cochair of the 3rd IEEE ICMEworkshop on Multimedia Services and Tools for smart-health (MUST-SH2020). He is a recipient of a number of awards, including the Best ConferencePaper Award and the 2016 ACM Transactions on Multimedia Computing,Communications and Applications (TOMM) Nicolas D. Georganas Best PaperAward. He is on the editorial board of the IEEE Transactions on Multimedia,IEEE Multimedia, IEEE Network, IEEE Wireless Communications, IEEEAccess, Journal of Network and Computer Applications (Elsevier), andInternational Journal of Multimedia Tools and Applications (Springer). Healso presently serves as a lead guest editor of and Multimedia systems Journal.He serves/served as a guest editor of IEEE Communications Magazine, IEEENetwork, ACM Transactions on Internet Technology, ACM Transactions onMultimedia Computing, Communications, and Applications (TOMM), IEEETransactions on Information Technology in Biomedicine (currently JBHI),IEEE Transactions on Cloud Computing, Multimedia Systems, InternationalJournal of Multimedia Tools and Applications (Springer), Cluster Computing(Springer), Future Generation Computer Systems (Elsevier). He is a seniormember of both the IEEE, and ACM.