observing the pulse of a city: a smart city framework for ... · aspects of the study, namely, 1)...

18
IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019 2651 Observing the Pulse of a City: A Smart City Framework for Real-Time Discovery, Federation, and Aggregation of Data Streams ¸ Sefki Kolozali , Maria Bermudez-Edo, Nazli FarajiDavar , Payam Barnaghi , Feng Gao, Muhammad Intizar Ali , Alessandra Mileo , Marten Fischer, Thorben Iggena, Daniel Kuemper, and Ralf Tonjes Abstract—An increasing number of cities are confronted with challenges resulting from the rapid urbanization and new demands that a rapidly growing digital economy imposes on current applications and information systems. Smart city appli- cations enable city authorities to monitor, manage, and provide plans for public resources and infrastructures in city environ- ments, while offering citizens and businesses to develop and use intelligent services in cities. However, providing such smart city applications gives rise to several issues, such as semantic het- erogeneity and trustworthiness of data sources, and extracting up-to-date information in real time from large-scale dynamic data streams. In order to address these issues, we propose a novel framework with an efficient semantic data processing pipeline, allowing for real-time observation of the pulse of a city. The proposed framework enables efficient semantic integration of data streams, and complex event processing on top of real- time data aggregation and quality analysis in a semantic Web environment. To evaluate our system, we use real-time sensor observations that have been published via an open platform called Open Data Aarhus by the City of Aarhus. We exam- ine the framework utilizing symbolic aggregate approximation to reduce the size of data streams, and perform quality analysis taking into account both single and multiple data streams. We Manuscript received May 21, 2018; revised August 22, 2018; accepted September 16, 2018. Date of publication September 28, 2018; date of cur- rent version May 8, 2019. This work was supported in part by the European Commission’s Seventh Framework Programme for the CityPulse Project under Grant 609035, and in part by the European Commissions Horizon 2020 IoTCrawler under Grant 779852. (Corresponding author: ¸ Sefki Kolozali.) ¸ S. Kolozali is with the Institute for Analytics and Data Science (IADS), School of Computer Science and Electronic Engineering, University of Essex, Essex CO4 3ZL, U.K., and also with the MRC-PHE Centre for Environment and Health, King’s College, London SE1 9NH, U.K. (e-mail: [email protected]). M. Bermudez-Edo is with the Institute for Communication Systems, Electronic Engineering Department, University of Surrey, Guildford GU2 7XH, U.K., and also with the School of Information Technology and Telecommunications, University of Granada, 18014 Granada, Spain. N. FarajiDavar is with the Institute for Communication Systems, Electronic Engineering Department, University of Surrey, Guildford GU2 7XH, U.K., and also with the Institute of Biomedical Engineering, University of Oxford OX3 7DQ, U.K. P. Barnaghi is with the Institute for Communication Systems, Electronic Engineering Department, University of Surrey, Guildford GU2 7XH, U.K. F. Gao and M. I. Ali are with the Insight Centre for Data Analytics, National University of Ireland, Galway, H91 TK33 Ireland. A. Mileo is with the Insight Centre for Data Analytics, Dublin City University, Dublin 9, D09 V209 Ireland. M. Fischer, T. Iggena, D. Kuemper, and R. Tonjes are with the Faculty of Engineering and Computer Science, University of Applied Sciences Osnabrück, 49074 Osnabrück, Germany. Digital Object Identifier 10.1109/JIOT.2018.2872606 also investigate the optimization of the semantic data discovery and integration based on the proposed stream quality analysis and data aggregation techniques. Index Terms—Complex event processing, Internet of Things (IoT), quality analysis, smart cities, time series analysis. I. I NTRODUCTION O VER centuries, cities have been a flourishing place for people on account of prosperity and socio-economic prospects. While citizens of a city form the key facet of city systems, city systems need to serve the demands of its inhabitants and elaborate interactions between different city applications. City authorities, however, encounter several difficulties in implementing, sustaining, and optimizing oper- ations and interactions among different city departments and services [1]. Therefore, smart cities seek to improve living standards and quality of life of its citizens through the uti- lization of the advancements in the Internet of Things (IoT) to tackle common urban challenges, such as reducing energy consumption, traffic congestion [2] and environmental pol- lution [3], [4]. Smart cities rely on data streams collected from various sensors (e.g., traffic congestion level, air qual- ity, and trash bin levels) to observe the pulse of cities. City of Aarhus provides an open data platform called open data aarhus (ODAA), 1 which contains city related information gen- erated by various sensors deployed within the city. Although such platforms allow city-related data to be published and shared, considering the fact that vast a number of sensors will be connected to the Internet in the future, leading con- sequent challenges in utilization and analysis of IoT data, the interpretation of time-series data and discovery of city events remain among some of the most vital challenges for smooth operation of cities [5]. In smart cities, collection, transmission, and processing of observations in IoT environments should be efficient. To cope with large volumes of data, dimensionality reduction tech- niques can be applied to reduce the size of the data while maintaining essential information and patterns in aggregated data. This allows the communication overhead in smart city 1 [Online]. Available: http://www.odaa.dk 2327-4662 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 23-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019 2651

Observing the Pulse of a City: A Smart CityFramework for Real-Time Discovery, Federation,

and Aggregation of Data StreamsSefki Kolozali , Maria Bermudez-Edo, Nazli FarajiDavar , Payam Barnaghi ,

Feng Gao, Muhammad Intizar Ali , Alessandra Mileo , Marten Fischer, Thorben Iggena,Daniel Kuemper, and Ralf Tonjes

Abstract—An increasing number of cities are confrontedwith challenges resulting from the rapid urbanization and newdemands that a rapidly growing digital economy imposes oncurrent applications and information systems. Smart city appli-cations enable city authorities to monitor, manage, and provideplans for public resources and infrastructures in city environ-ments, while offering citizens and businesses to develop and useintelligent services in cities. However, providing such smart cityapplications gives rise to several issues, such as semantic het-erogeneity and trustworthiness of data sources, and extractingup-to-date information in real time from large-scale dynamicdata streams. In order to address these issues, we propose a novelframework with an efficient semantic data processing pipeline,allowing for real-time observation of the pulse of a city. Theproposed framework enables efficient semantic integration ofdata streams, and complex event processing on top of real-time data aggregation and quality analysis in a semantic Webenvironment. To evaluate our system, we use real-time sensorobservations that have been published via an open platformcalled Open Data Aarhus by the City of Aarhus. We exam-ine the framework utilizing symbolic aggregate approximationto reduce the size of data streams, and perform quality analysistaking into account both single and multiple data streams. We

Manuscript received May 21, 2018; revised August 22, 2018; acceptedSeptember 16, 2018. Date of publication September 28, 2018; date of cur-rent version May 8, 2019. This work was supported in part by the EuropeanCommission’s Seventh Framework Programme for the CityPulse Project underGrant 609035, and in part by the European Commissions Horizon 2020IoTCrawler under Grant 779852. (Corresponding author: Sefki Kolozali.)

S. Kolozali is with the Institute for Analytics and Data Science (IADS),School of Computer Science and Electronic Engineering, University ofEssex, Essex CO4 3ZL, U.K., and also with the MRC-PHE Centre forEnvironment and Health, King’s College, London SE1 9NH, U.K. (e-mail:[email protected]).

M. Bermudez-Edo is with the Institute for Communication Systems,Electronic Engineering Department, University of Surrey, GuildfordGU2 7XH, U.K., and also with the School of Information Technology andTelecommunications, University of Granada, 18014 Granada, Spain.

N. FarajiDavar is with the Institute for Communication Systems, ElectronicEngineering Department, University of Surrey, Guildford GU2 7XH, U.K.,and also with the Institute of Biomedical Engineering, University of OxfordOX3 7DQ, U.K.

P. Barnaghi is with the Institute for Communication Systems, ElectronicEngineering Department, University of Surrey, Guildford GU2 7XH, U.K.

F. Gao and M. I. Ali are with the Insight Centre for Data Analytics, NationalUniversity of Ireland, Galway, H91 TK33 Ireland.

A. Mileo is with the Insight Centre for Data Analytics, Dublin CityUniversity, Dublin 9, D09 V209 Ireland.

M. Fischer, T. Iggena, D. Kuemper, and R. Tonjes are with the Facultyof Engineering and Computer Science, University of Applied SciencesOsnabrück, 49074 Osnabrück, Germany.

Digital Object Identifier 10.1109/JIOT.2018.2872606

also investigate the optimization of the semantic data discoveryand integration based on the proposed stream quality analysisand data aggregation techniques.

Index Terms—Complex event processing, Internet of Things(IoT), quality analysis, smart cities, time series analysis.

I. INTRODUCTION

OVER centuries, cities have been a flourishing place forpeople on account of prosperity and socio-economic

prospects. While citizens of a city form the key facet ofcity systems, city systems need to serve the demands ofits inhabitants and elaborate interactions between differentcity applications. City authorities, however, encounter severaldifficulties in implementing, sustaining, and optimizing oper-ations and interactions among different city departments andservices [1]. Therefore, smart cities seek to improve livingstandards and quality of life of its citizens through the uti-lization of the advancements in the Internet of Things (IoT)to tackle common urban challenges, such as reducing energyconsumption, traffic congestion [2] and environmental pol-lution [3], [4]. Smart cities rely on data streams collectedfrom various sensors (e.g., traffic congestion level, air qual-ity, and trash bin levels) to observe the pulse of cities. Cityof Aarhus provides an open data platform called open dataaarhus (ODAA),1 which contains city related information gen-erated by various sensors deployed within the city. Althoughsuch platforms allow city-related data to be published andshared, considering the fact that vast a number of sensorswill be connected to the Internet in the future, leading con-sequent challenges in utilization and analysis of IoT data, theinterpretation of time-series data and discovery of city eventsremain among some of the most vital challenges for smoothoperation of cities [5].

In smart cities, collection, transmission, and processing ofobservations in IoT environments should be efficient. To copewith large volumes of data, dimensionality reduction tech-niques can be applied to reduce the size of the data whilemaintaining essential information and patterns in aggregateddata. This allows the communication overhead in smart city

1[Online]. Available: http://www.odaa.dk

2327-4662 c© 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2652 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

frameworks (SCFs) to be reduced and facilitates more advancetasks to be performed in large scale, such as clustering, outlierdetection, and event detection. However, most of the exist-ing approaches transmit raw sensor data constantly [6]–[13].Therefore, novel methods are required to provide aggregateddata transmission to save processing power while distribut-ing data in the whole network. In the meantime, semanticrepresentation of the aggregations and abstractions are crucialto provide machine-interpretable observations for higher-levelinterpretations of the real world context.

Reliability, inconsistency and incompleteness are other chal-lenges in smart cities. Since city data is provided by diversesources, quality of data varies: it may include errors, miss-ing values, and in some cases data collected from citizens(e.g., smart phone sensors or social media) can be biased.Determining and managing trustworthiness of informationsources is an important step toward safe and secure operationof smart cities. Provenance information is one of the mostimportant features to keep track of the source of informationand assess trustworthiness of different information sources.However, although there are increased numbers of studies inthis field [14]–[16], there remains a need to develop tech-niques for real-time monitoring of heterogeneous data streamsto detect reliability violations and adapt data acquisitionaccordingly.

In addition to data aggregation, abstraction, and quality ofinformation (QoI) issues, smart city applications need to han-dle discovery and integration of heterogeneous data sources,and extract up-to-date information in real-time to enablefurther complex processing. Since current semantic servicediscovery and composition approaches (e.g., WSMO, OWL-S)are based on input, output, precondition, and effect (IOPE),they are not well suited to describe complex event processingservices with event patterns and to cope with dynamic andresource constrained data streams.

In this paper, we present a novel framework developedwithin the EU FP7 CityPulse project2 for real-time datastream analysis using data aggregation, quality analysis, andoptimization techniques to improve the performance of real-time semantic integration of data streams and complex eventprocessing. Using a real-world traffic dataset that is collectedfrom the City of Aarhus, we evaluate the performance of theframework in three parts.

1) Data aggregation using symbolic aggregate approxima-tion (SAX) algorithm.

2) Quality analysis of data streams based on analysis ofsingle data streams as well as evaluation of multipledata stream sources describing the city traffic flow.

3) Semantic integration of data using the estimation andassessment of Quality of Service (QoS) for event servicecompositions using a genetic algorithm (GA).

The remainder of this paper is organized as follows.Section II describes the related work. Section III demonstratesthe proposed framework and semantic annotation of each stageincluding data federation, aggregation, abstraction, and relia-bility processing of data stream. Section IV provides a use case

2[Online]. Available: http://www.ict-citypulse.eu/page/

scenario and evaluations of the proposed framework. Section Vdetails a discussion and describes the future work.

II. RELATED WORK

In this section, we present the existing smart city and IoTframeworks and we discuss the related work in the three mainaspects of the study, namely, 1) semantic annotation and dataaggregation; 2) QoI, stream dependency and correlating infor-mation sources; and 3) semantic data integration and complexevent processing.

A. Smart City Frameworks

Several IoT frameworks, and in particular SCFs, have beendeveloped in recent years, enabling access to sensory data andinteraction with sensors. One of the pioneers in IoT platformsis global sensor networks (GSNs) [17] which offers a federa-tion of sensor networks by means of XML-based deploymentdescriptors that homogenized the sensory data. More recently,the standard de facto IoT architecture, IoT-A [18], has pro-vided a semantic description based on the concept sensing-as-a-service that other platforms have adopted, such as IoT-est.IoT-est [10] is a dynamic service and test framework, thatallows the creation of dynamic composed services that pro-vides access to heterogeneous sensory data. OpenIoT [9] isan instantiation of IoT-A [18], and similar to IoT.est, it usesthe concept of sensing-as-a-service, and provides cloud-basedIoT services including a middleware platform and tools forapplication development. It provides access to an enhancedversion of GSN and uses semantics for better interoperability,based on the IoT-A model. Some well-known examples of thisarchitecture that are focused solely on providing a data plat-form, semantic modeling, and/or semantic sensor discovery areKm4city [6], SmartSantander [7], Spitfire [8], OpenIoT [9],IoT.est [10], OpenCube [11], iCore [12]. Start-City [13]semantically annotates and aggregates traffic data to predictspatio-temporal traffic conditions. The ontology here is domainspecific (traffic domain), without generic concepts that can bereused for different kinds of domains. The aggregation of thedata is based on traditional aggregation methods (e.g., average,maximum, and minimum). OneM2M3 is working on a newstandard for machine-to-machine communications, coveringaspects, such as protocols, security, services, and data man-agement. However, data aggregation at the framework layeris not supplied, at least in the current version of OneM2M.This initiative is also studying the possibility of adding seman-tics to the standard through the first draft of SAREF [19].In this current version SAREF only covers household appli-ances at a physical level. A meta model was built by [20]in order to annotate the sensory data. While this paper intro-duced numerous metadata to represent sensory observationsand devices, it is suffers from not having any links to oneof the well-known W3C semantic sensor network ontology.4

There has been a recent study, such as [21] where authorstested the performance of their IoT middleware based on

3[Online]. Available: http://www.onem2m.org/4[Online]. Available: https://www.w3.org/TR/vocab-ssn/

Page 3: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

KOLOZALI et al.: OBSERVING PULSE OF CITY 2653

memory consumption and efficiency. It was particularly testedon smart phones. Although the purpose of this paper alsoincludes efficient communication in IoT systems, developinga unified middleware application is beyond the scope of thispaper. Recently, there has been similar studies in the smartcity domain, such as [22] and [23]. However, while thesestudies contributed to the smart city systems in their ownright, none of these studies tackle the semantic interoperabil-ity, data aggregation, complex event processing and QoI all atthe same time in the semantic Web environment for smart citysystems.

Nonetheless, none of these platforms involve ontologies tohandle generic annotations for heterogeneous data streamsnor an effective time-series data analysis approach for thesensory domain. Consequently, there still remains a needfor efficient semantic knowledge representation and adaptiveaggregation of sensory observations in dynamic environments,such as smart cities that handles real-time or almost real timeannotations.

B. Time-Series Data Aggregation and Abstraction

Regarding the time-series analysis, there is a need for muchsmaller storage space and faster processing due to increas-ing size and channels of available time-series data streams.Data reduction often serves as the first step in an effort tokeep good-quality synopsis of data. Lower bounding princi-ple assures preserving the meaning of data by keeping thedistance between two time-series data streams in the reducedspace (i.e., aggregated data) less than or equal to the true dis-tance of time-series data (i.e., the original data). Some of thewell-known algorithms for numerical representation of time-series analysis that accomplish good quality of data reductionand lower bounding principle are discrete Fourier transform(DFT) [24], discrete wavelet transform (DWT) [25], [26], sin-gular value decomposition [27], piecewise aggregate approx-imation (PAA) [28]. DFT enables to transform time-seriesdata into the frequency domain by obtaining a single complexnumber, called Fourier coefficient for each signal by the super-position of a finite number of sine (and/or cosine) waves. Onceit has been transformed into the frequency domain, it allowsthe Fourier coefficients with high amplitude to be obtained anddiscards the low amplitude coefficients for data compressionwithout much loss of information. However, although it satis-fies the lower bounding principle by using Parseval’s Theorem,the computational complexity (i.e., C(n2) time, or C(n log n)

time with algorithm in [29]) and choice of the best numberof coefficients are the main challenges of this algorithm [30].While Fourier coefficients always represent global contribu-tions of the data, DWT represent small, local segments of thedata with a less computational complexity (C(n)) and satis-fies lower bounding principle. Its limitation is the data lengthmust be a power of two (n = 2m). In parallel, SVN can alsoperform dimensionality reduction by optimally transforminga dataset into a new k-dimensional dataset based on the firstordered k-biggest singular values. However, it demands to useof an entire dataset prior to transformation to perform dimen-sionality reduction, and it cannot work incrementally since a

new data insertion requires a new global computation. On theother hand, PAA is a very simple and strongly competitivemethod compared to more sophisticated transforms, such asDFT and DWT. The computational complexity is low (C(n)),and supports lower bounding principle, which can be simplycalculated using the distances on PAA representation.

Contrary to the numerical approaches, the discretization ofthe original data into symbolic strings has not been consid-ered in great detail. Even though it seems a straightforwardsolution, it comes with substantial advantages over existingalgorithms and data structures that enable the efficient manip-ulations of symbolic representations in addition to allowingthe framing of time. However, while general symbolic repre-sentation methods are not capable of calculating distance insymbolic space and supporting lower bounding at the sametime [31], [32], SAX is the most known symbolic representa-tion technique on time series data mining [33]. Due to the factthat it is based on PAA, it is not computationally expensive,and ensures both considerable dimensionality reduction andlower bounding support. Therefore, we use the SAX algorithmto obtain reduced space of time series data.

C. Quality of Information, Stream Dependency andCorrelating Information Sources

When processing data streams from external sources anoften underestimated part is the evaluation of the quality ofthe provided data. Strong et al. [34] described the impacts offaulty information. Their work is focused on the utilizationof large scale databases that contain data from multiple datasources, where it has been pointed out that faulty, incorrector incomplete data can cost billions of dollars. To describepossible issues affecting the utilization of information, theydefined four categories with a list of dimensions for dataquality. These categories are used to describe the quality offixed data sets (in comparison to smart city data streams) likedatabase. In this paper a (re)evaluation of the QoI for incom-ing sensor observations is envisaged, reflecting the dynamicsof smart cities.

While the work by Wang and Strong is the theoreti-cal cornerstone for the categorization of quality metrics,Stvilla et al. [35] presented a framework for the assessment ofinformation quality. The framework is designed as an abstractmodel and can be extended for specific quality measurementsettings. However, it requires a domain expert to define andimplement quality measurement rules for each data set. Thus,it is not feasible in a smart city context, where new datastreams need to be integrated fairly quickly and frequently.This requires a generic approach that automatically adapts tonew data streams.

A development toward application or context independentquality measurement is described by Bisdikian et al. [36].To describe both application dependent as well as applicationindependent quality metrics, they divided the quality analy-sis into two main parts: namely QoI and value of information(VoI). A UML-based data model is introduced supporting bothparts of the quality analysis in order to provide a general

Page 4: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2654 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

template for organizing QoI/VoI meta-data allowing an effec-tive exchange of QoI/VoI data in a repeatable, consistent, andreproducible manner. However, it lacks machine interoperabledata format.

D. Complex Event Processing and QoS-Aware EventService Composition

Event processing systems are crucial for smart city appli-cations to extract high-level and complex situations fromlow-level information. Traditional event notification systemsonly allow filtering over primitive level events, where eachevent is considered an individual entity [37]. As an advance-ment of the traditional event notification systems, complexevent processing systems can filter and correlate multipleevents by matching a specific pattern [38], [39]. Harnessingover the benefits of semantic technologies, ontology-basedevent specification has been presented in [40]. However, mostof the existing complex event ontologies [41], [42] lackexpressivity to describe a broader range of complex event oper-ators. Moreover, these ontologies do not include the semanticdescription of nonfunctional properties within specificationsof complex events, hence hinder the implementation of aquality-aware complex event composition.

QoS-aware service composition has been studied exten-sively. The first step of solving the QoS-aware service com-position problem is to define a QoS model, a set of QoSaggregation rules and a utility function. Existing works havediscussed these topics extensively, e.g., in [43]–[45]. However,the aggregation rules in existing works focus on conventionalWeb services rather than complex event services, which needa different QoS aggregation schema. For example, the eventengine also has an impact on the QoS aggregation, which isnot considered in conventional QoS aggregation for services.Also, the aggregation rules for some QoS properties basedon event composition patterns is different to those based onworkflow patterns (as in [44]).

As a second step, different concrete service compositionsare created and compared with regard to their QoS utilitiesto determine the optimal choice. To achieve this efficiently,various heuristic approaches are developed. There are twoprominent strands: 1) integer programming (IP) (e.g., [43],[46] and [47]) and 2) GA (e.g., [48]–[51]) based solutions.

In [46] the limitation of local optimization and the neces-sity of global planning are elaborated. The authors propose toaddress the complexity problem of global planning by intro-ducing an IP-based solution with a simple additive weighting-based utility function to determine the desirability of anexecution plan. This approach is extended in [47] with moreheuristics to promote efficiency. In [43] a hybrid approach oflocal and global optimization is proposed, in which globalconstraints are delegated to local tasks, and the constraint del-egation is modeled as an IP-based optimization problem. Theproblem with IP-based solutions is that they require QoS met-rics to be linear, and they do not address the service replanningproblem.

In [48] the chromosomes are encoded with binary bits rep-resenting whether a service is selected or not. The problem

with this approach is that the readability of the genomes arepoor and the chromosome length is not fixed during evolution.Canfora et al. [49] used a different encoding approach, whichleads to a fixed chromosome length. In [50] a 2-D genomeencoding is proposed to express all execution paths while con-sidering task relations, but crossover and mutation need valida-tion. Gao et al. [51] used tree coding chromosomes, crossoversoperate on subtrees and mutation operate on leaf nodes toavoid invalid reproductions. Karatas and Kesdogan [52] devel-oped a GA-based approach that goes beyond QoS-aware com-position and enables compliance-aware service composition.

The above GA-based approaches can only evaluate servicecomposition plans with fixed sets of service tasks (abstractservices) and cannot evaluate composition plans which aresemantically equivalent but consists of different service tasks,i.e., service tasks on different granularity levels. A more recentwork in [45] addresses this issue by presenting the concept ofgeneralized component services (GCSs) and developing theGA encoding techniques and genetic operators based on GCS.Results in [45] indicate that up to 10% utility enhancementcan be obtained by expanding the search space. Composingevents on different granularity levels is also a desired fea-ture for complex event service composition. However, thework described in [45] only caters for IOPE-based servicecompositions, which creates composition plans as imperativeworkflows. Complex event service composition creates declar-ative event patterns (transformed from the requested eventpattern with the same event semantics) as composition plans.It requires an event pattern-based reuse mechanism [53] and asa result, different genetic encoding mechanisms and crossoveroperations are needed.

III. LARGE SCALE DATA ANALYSIS WITH

CITYPULSE FRAMEWORK

In this section, we present a large-scale data analysisframework for smart cities. The proposed framework enablessemantic annotation and analysis of IoT and social mediadata streams by taking into account dimensionality reduc-tion and reliability processing. It involves the following units:1) visualization; 2) federation; 3) aggregation; and 4) reliableinformation processing. While the first three units are partof the large scale data analysis component, the last unit hasbeen divided into atomic and composite monitoring units. Theatomic reliable information processing components are calledQoI units assessing information quality in the first appear-ance of the data in the IoT system. Fig. 1 depicts the maincomponents and basic workflow of the framework.

Initially, the sensor nodes transmit either raw or aggregateddata to the virtualization component. After collecting the data,it forwards it to three components of the system, namely,semantic annotation, data federation, and reliable informa-tion processing. The data federation component discovers andcomposes data streams to answer queries over multiple streamsin real-time. In parallel, the pattern creation and discretiza-tion process is applied in the data aggregation component.Afterward, the event detection component performs abstrac-tion process. The abstracted data is finally accessible through

Page 5: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

KOLOZALI et al.: OBSERVING PULSE OF CITY 2655

Fig. 1. Architectural overview of the CityPulse framework. The examinedunits have been illustrated in shaded areas.

the middleware where different layers, such as real-time intel-ligence layer or graphical user interface can access the data.In parallel, the reliable information processing aims at theannotation of information sources with QoI. Therefore, it con-ducts active evaluation of QoI for data sources and their steadyadaption triggered by events that could be sent by the moni-toring components and the event management components ofother CityPulse framework modules. Fig. 2 illustrates an exam-ple snapshot from the 3-D visualization tool of the CityPulseframework.

A. Virtualization

The virtualization component facilitates access to heteroge-neous data sources and infrastructure concealing the technicalfacets of data streams, such as location, storage structure,access format, and streaming technology. The system desig-nates various wrappers to encompass a large number of inputformats, while it provides a unified format as output (i.e., RDFor Turtle). Describing the obtained sensor data stream for inter-operability or facilitated search is the core objective of thiscomponent. However, the amount of IoT data streams can bevoluminous, while the details often provided by resource con-strained devices with limited bandwidth, memory, or power.Therefore, the information model that is being used by the IoTand SCFs not only needs to explicitly represent the meaningand relationships of terms in vocabularies but also should belightweight in order to reduce the traffic and processing time.In our experiments, we used CityPulse information models,namely, stream annotation ontology,5 quality ontology,6 andcomplex event processing ontology7 for semantic representa-tion of data aggregation, quality, and complex event services,which are extensions of existing ontologies, such as W3C SSNand other data annotation frameworks to describe the streamsand their resources as well as quality related features of the

5[Online]. Available: http://iot.ee.surrey.ac.uk/citypulse/ontologies/sao/sao6[Online]. Available: https://mobcom.ecs.hs-osnabrueck.de/cp_quality/7[Online]. Available: http://citypulse.insight-centre.org/ontology/ces/

TABLE IBREAKPOINTS THAT DIVIDE A GAUSSIAN DISTRIBUTION IN AN

ARBITRARY NUMBER FROM 3 TO 10 OF EQUIPROBABLE REGIONS

data. In the implementation stage, we used a semantic anno-tation library, called SAOPY,8 which involves the ontologiesgiven above along with state-of-the-art IoT-related ontologies.

B. Data Aggregation and Abstraction

SAX algorithm [33] transforms a time-series into a dis-cretized series of letters, e.g., a word. It divides a time seriesdata into equal segments and then creates a string representa-tion for each segment. The algorithm involves three main steps,namely normalization, PAA, and discretizing of the aggregateddata. Initially, time series data is normalized to have a mean ofzero and standard deviation of one before converting it to PAA.Afterward, PAA divides the original data into desired numberof windows and calculates the average amount of data fallinginto each window. This results in a reduction of data size. Ashorter window length n results in a better reconstruction ofthe original data, however, more data space is needed to storethe data and eventually higher energy consumption by highercommunication costs. The calculation of each window of PAAsegments is given as

Zi = w

n

nw i∑

j= nw (i−1)+1

Zj (1)

where zi represents the ith element of a time series data, Z, oflength n, and w represents the number of segments. This resultsin a reduction of data size from n to n/w data points. Oncetime series data transformed into PAA coefficients, symboliz-ing the PAA representation into a discrete string is the finalstage. Considering the fact that normalized time series data fol-lows a Gaussian distribution, the discretization phase allowssymbolic representations of data to be obtained by mappingPAA coefficients to breakpoints that are produced accordingto the alphabet size a, which in turn determines equal-sizedareas under a Gaussian curve. Table I shows the Gaussianbreakpoints for values of alphabet size, a, from 3 to 10. Thedefinition for breakpoints are given below.

Definition 1 (Breakpoints): Breakpoints are a sorted list ofnumbers B = β, . . . , βa−1 such that the area under a N(0,1)Gaussian curve from βi to βi+1 = 1/a, where β0 and βa aredefined as −∞ and ∞, respectively.

The break lines are distributed vertically according to theGaussian distribution, the first letter of the alphabet represents

8[Online]. Available: http://iot.ee.surrey.ac.uk/citypulse/ontologies/sao/saopy.html

Page 6: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2656 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

Fig. 2. Example snapshot from CityPulse 3-D map, which illustrates the traffic events that are present on the travel route of a user in the City of Aarhus.

(a) (b) (c)

Fig. 3. Real time data obtained for average speed from a pair of sensor points are normalized, discretized by first obtaining a PAA approximation and thenusing predetermined breakpoints to map the PAA coefficients into SAX symbols. In the example above, with n = 144, w = 6, and a = 5, the time series ismapped to the words “bbdcddbdcabcbbcbcdbbdddddcc.” (a) Normalization. (b) PAA. (c) Discretization.

the smallest PAA coefficient, a, and the greatest PAA coef-ficient is represented by the last letter of the alphabet. Thedefinition to obtain the symbolic representation is given below.

Definition 2 (Word): A subsequence C of length n can berepresented as a word C = c1, . . . , cw as follows. Let βi denotethe ith element of the alphabet, i.e., β1 = a and β2 = b. Thenthe mapping from a PAA approximation C to a word C isobtained as follows:

ci = βj, ⇐⇒ βj−1 ≤ ci < βj. (2)

Example 1: Let us assume that we have a time-seriesdata, time-series (c) = {2, 3, 4.5, 7.6}. Following the stepsgiven above, we apply z-transform and obtain time-series (z)= {−0.93,−0.52, 0.09, 1.36}. Here, we use SAX with win-dow size of “2” and alphabet size of “4”: it leads to a set ofPAA coefficients of {−0.72, 0.72}. Finally, we map the PAAcoefficients into SAX symbols by using the cut off ranges of β

and βa−1, given in Table I, {−0.67, 0, 0.67}, and obtain corre-sponding SAX word {ad}. Given that the first PAA coefficientis smaller than −0.67 and the second coefficient is greaterthan 0.67, the former is assigned to a and latter to d.

1) Semantic Annotation of Aggregated Data: Using SAO,we can describe the time series data further to take into accountaggregated features as well as temporal entities, such as partic-ular segments of a data stream. Fig. 3 shows an exemplificationof the creation of SAX patterns obtained from the real-timedata—average speed from a pair of sensor points—which are

initially normalized using z-transform [Fig. 3(a)], segmentedusing PAA [Fig. 3(b)], and mapped to SAX symbols in dis-cretization phase [Fig. 3(c)]. Listing 1 depicts an example ofdescribing a PAA segment from the data stream presented inFig. 3, which has been extracted by SAX algorithm with alpha-bet size of “5” and segmentation size of “2928.” First, weidentify a data stream, then we describe a timeline instance.This timeline is used to link the segment feature descriptionwith the time extent of a temporal entity representing the datastream. Thus, we can express a stream data as a time intervalon the universal timeline, and also relate such an interval withthe corresponding interval on the discrete timeline. This par-ticular segment represents 30-min time intervals and a windowsize of 6. The output of discretization phase has also been pro-vided with an SAX letter, c for a PAA segment as well as anSAX word as an overall SAX output for the entire data streamusing sao:value in the RDF document.

C. Quality Analysis

Due to the large amount of data it is not feasible to deter-mine the QoI with respect to the application that processes thedata. Thus, we use the application-independent approach intro-duced in [15]. To realize an application independent approach,a new ontology was developed to describe the quality of datasources. Some of the ambiguous definitions of quality con-cepts (as in Section II) required redefinition in the CityPulse

Page 7: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

KOLOZALI et al.: OBSERVING PULSE OF CITY 2657

TABLE IIQUALITY METRICS CATEGORISATION AND DEFINITION

Listing 1. Summarized data expressed using the stream annotation ontology.

context. Table II lists the categories along with their defini-tions to understand their utilization in the framework and toavoid misunderstandings caused by different definitions in thereferenced literature. It consists of typical quality categorieslike Accuracy or Timeliness. These categories maybe further divided into subcategories (e.g., Completenessand Correctness in the category Accuracy). The qual-ity value of an upper-level category is the combination of

the contained lower-level QoI metrics. The intention is todetermine the values for the submetrics by analyzing sensorobservations and provide a top-level QoI metrics as a tupleof lower level categories. The Communication categorycontains attributes to describe QoS characteristics. In addi-tion, the categories Cost and Security contain attributesthat describe nonfunctional static characteristics of the streamsimportant for the end user. Throughout this paper, we useCompleteness, and Correctness subcategories to rep-resent the Accuracy of data streams, and Frequency,Latency and Age to represent the Timeliness of the datastreams. The following list describes the upper-level categorieswhile the lower-level categories for each of the upper-levelcategories can be found in Table II.

1) Accuracy: Metrics in the Accuracy categorydescribe the degree to which delivered informationis correct, precise, and complete. To determine theResolution, Deviation, Completeness, andCorrectness, and to allow a comparison with geo-spatial related data streams, a registration of sensorcharacteristics is necessary.

2) Communication: As stated earlier theCommunication category contains attributes relatedto typical QoS parameters of a data stream. With thiscategory it is possible that an application receivesonly information from data sources that fulfil the QoSrequirements, e.g., have only a small packet loss.Example subcategories are: Bandwidth, Latency,or Throughput.

3) Cost: The Cost category is required to be able todescribe the monetary, energy, and network costs of adata stream. This will allow an application (depending

Page 8: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2658 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

(a)

(b)

Fig. 4. Quality Analysis—General Approach.

on the user’s preferences) using the CityPulse frameworkto potentially find a tradeoff between cost-efficient andreliable data source.

4) Security: This category contains metrics to describethe permission levels for provided data, e.g., if thedata owner allows the data to be republished/distributedor prohibits any further usage. Furthermore metricsto describe the encryption and data integrity throughsigning mechanisms are included in this category.

5) Timeliness: With the metrics in the Timelinesscategory it is possible to select data streams by theirupdate frequency, the amount of time information isvalid (Volatility) and the timespan between the datameasurements and the publication time (Frequency).

To use the full variety of information sources in city deploy-ments, a vast number of heterogeneous interfaces and dataformats have to be adapted. Furthermore, the varying informa-tion quality and uncertain reliability of multiple informationproviders has to be considered in the selection of the bestavailable data stream. Therefore, the framework continuouslymonitors and calculates the QoI of incoming data streams andexploits this information to provide reliable applications.

1) General Approach: The quality analysis involvesmultiple quality metrics, which have been introduced aboveand in Table II. Fig. 4 shows the general approach in cal-culating the quality of a data source/stream for the updatefrequency of a simple temperature sensor. In the first step atemperature sensor (temperature data stream) is registered atthe framework. Within the registration the frequency of thestream is annotated with “60 s” which implies that the sensorshould deliver an update every 60 s. The second part of thegraphic shows the update mechanism. If the stream sends anupdate, the time difference between the current and the lastupdate is calculated. If it fits the registered frequency (exam-ple a of Fig. 4) the frequency quality increases or stays at themaximum level. The second update (example b) is receivedwith a time difference of 77 s. This denotes that the registeredfrequency for the stream update is not fulfilled. In this case,the frequency quality is decreased.

2) Correlating Information Sources: The goal of the qualityanalysis is to produce a metric which indicates the currentperformance of a sensor node compared with promised qualitymetrics annotated during the registration process. This metriccan be either the absolute value of a specific QoI metric or anormalized value. An advantage of using a normalized valueis 1) that sensor nodes of different providers can be comparedand 2) different QoI metrics can be combined to derive anoverall reputation metric for a sensor node or network. Tocalculate normalized QoI values the following algorithm was

designed. The output of the algorithm is in the range of 0and 1. The input is a discrete value and indicates if the lastobservation update satisfies the annotated quality metric. Theoutput is increased if the input was positive with respect to theoutput value range, and vice versa. The algorithm considerspast behavior of the sensor node so that continuous reliabilityresults in a higher output. The magnitude of a negative inputdecreases over time. At the same time it allows the system tofully recover from negative inputs, such as sporadic outliersafter a series of observation updates. The algorithm requiresa constant amount of memory and processing power.

In our quality analysis, we have modified the reward andpunishment algorithm introduced in [54]. The witness (i.e.,a sensors) agrees if the new value is within the upper andlower bounds. Quality value has been computed based on themajority of agreed witnesses. In our approach, we combinethe reward and the punishment equations into one equation.It uses a sliding window over the last inputs with windowlength W. The quality metric is calculated as follows:

q(t) = |q(t − 1) − 2 ∗ Rd(t)| (3)

where q(t) is the quality metric at time t and q(t − 1) is thepast quality metric. Rd(t) denotes the reward to be added forthe current input. The reward is calculated as

Rd(t) = αW−1(t − 1)

W − 1− αW−1(t − 1) + αcurrent(t)

W(4)

where αW−1 denotes the number of positive entries within thewindow and αcurrent ∈ {0, 1} the current input.

D. Semantic Sensors Stream Discovery and Integration

In this section, the ontology used for describing event ser-vices and event requests are presented, the discovery andintegration mechanism for the sensor data streams are dis-cussed. Notably, the discovery task given in Section III-D1should not be confused with the integration task described inSection III-D2, where former matches most fine-grained andatomic service requests and responses, and the latter deals withservice compositions.

1) Sensors Streams Discovery: The task of sensor streamdiscovery is to find candidate sensor services based on sen-sor service descriptions and request specifications. A sensorstream is an atomic unit in IoT stream discovery and integra-tion. It is described both as a PrimitiveEventService in CESontology, as well as a Sensor device in SSN ontology. The CESontology is mainly used to describe the nonfunctional aspectsof sensor service requests/descriptions, including sensor eventtypes, quality parameters, and sensor service groundings. SSNontology is used to describe the functional aspects, includingObservedProperties and FeatureOfInterest.

A sensor service description is denoted as sd =(td, g, qd, Pd, FoId, fd), where t is the sensor event type, gis the service grounding, qd is a QoS vector describing theQoS values, Pd is the set of ObservedProperties, FoId is theset of FeatureOfInterests and fd : Pd → FoId is a functioncorrelating observed properties with their feature-of-interests.

Page 9: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

KOLOZALI et al.: OBSERVING PULSE OF CITY 2659

Listing 2. Traffic sensor service description

Similarly, a sensor service request is denoted

sr = (tr, qr, Pr, FoIr, fr, pref, C).

Compared to sd, sr do not specify service groundings, qr

represents the constraints over QoS metrics, pref representsthe QoS weight vector specifying users’ preferences on QoSmetrics and C is a set of functional constraints on the valuesof Pr. sd is considered a match for sr iff all of the followingthree conditions are true.

1) tr subsumes td.2) qd satifies qr.3) ∀p1 ∈ Pr, ∃p2 ∈ Pd =⇒ T(p1) subsumes p2 ∧ fr(p1) =

fd(p2), where T(p) gives the most specific type of p ina property taxonomy.

Listing 2 shows a snippet of a traffic sensor descrip-tion in Turtle syntax. The traffic sensor monitors theestimated travel time, vehicle count, and average vehiclespeed on a road segment (annotated as observed prop-erties). As a sensor service, it presents a service profile(:sampleProfile) that describes the service category(ces:TrafficReportService) for discovery and sup-ports a service grounding (:sampleGrounding) for invo-cation. Listing 3 shows a snippet of a sensor servicerequest matched by the traffic sensor service. Comparedto Listing 2, the request does not contain a grounding.The property ces:EstimatedTime is requested on thefeature-of-interest :FoI_3. Also, the requested service type(ces:TrafficReportService) is an exact match for theservice type in Listing 2, making the service described inListing 2 a matching service candidate for the request. Whenthe discovery component finds all candidate services suitablefor the request, a simple-additive-weighting algorithm [55] isused to rank the service candidates based on qd, qr, and pref.

Alternatively to the above discovery approach implementedby a middleware that parses the requests and service descrip-tions, a user can also perform some basic discovery functionusing simple protocol and RDF query language (SPARQL).9

Listing 5 shows a sample SPARQL query that identifiessensor services by querying the properties they measure.The query results shall contain the service in Listing 2,

9[Online]. Available: http://www.w3.org/TR/rdf-sparql-protocol

Listing 3. Traffic sensor service request.

Listing 4. Complex event service request.

Listing 5. Simple discovery via SPARQL.

Listing 6. Entail subpatterns via RDF containers.

if the ces:AverageSpeed is annotated as a subclass ofces:Speed.

2) Sensors Streams Integration: Sensor stream discoverydeals only with primitive event service discovery. To discoverand integrate (composite) sensor streams for complex eventservice requests, the event patterns specified in the complexevent service requests/descriptions need to be considered.

We insert the rule in Listing 6 into the complex event patternontology, to allow reasoners to entail subpattern relationships.Notice that rdfs:member is the super-property for the con-tainer membership property (i.e., rdf:_1, rdf:_2 ...)in RDF Schema version 1.1.

Page 10: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2660 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

Listing 7. Tracking pattern provenance via SPARQL.

In the context of integrated sensor stream discovery andcomposition, the definition of sensor stream description isextended to denote composite sensor stream descriptions Sd =(epd, Qd, G), where epd consists of a set of (primitive or com-posite) sensor stream descriptions, and a set of event operatorsincluding Sequence, Repetition, And, Or, Selection, Filter andWindow, qd is the aggregated QoS metrics for Sd and G is thegrounding for the composite sensor stream. Similarly, a com-plex event service request is denoted as Sr = (epr, Qr, pref),where epr is a canonical event pattern consisting of a set ofprimitive sensor service requests and a set of event operators,Qr describes the QoS constraints for the requested complexevent service and pref specifies the weights on QoS metrics.

An Sd is a match for Sr iff epd is semantically equiv-alent to epr and Qd satisfies Qr. When no matches arefound during the discovery process for Sr, it is necessaryto compose Sr with a set of primitive or composite sensorstreams which are reusable to Sr. Informally, these (com-posite) sensor streams describe part of the semantics of epr

and can be reused to create a composition plan, which con-tains an event pattern with concrete service bindings. Thecomposition plan can be used as a part of the event servicedescription for the composed event service. The discovery orcomposition results can be ranked with respect to the QoSmetrics and preferences in the same way as sensor streamdiscovery. We refer readers to [53] and [55] for detaileddefinitions of concepts related to event patterns as well as algo-rithms to perform an efficient pattern-based and QoS-awareevent service discovery and composition. Listing 4 showsa snippet of a sample complex event service request withan event pattern and some NFP constraints. The requestedpattern is a conjunctive (ces:And) correlation betweenfour requested primitive events (e.g., :locationRequest,:seg1CongestionRequest, etc.). The QoS constraintsin the request ask for the ces:Availability andces:Accuracy to be above 90%.

Leveraging the subpattern property in CES ontology,one can query the provenance relations (within the sameEventProfile) specified in composition plans (as well asother event patterns) using the query specified in Listing 7,with OWL reasoners (augmented with the rule in Listing 6).In order to track provenance relations among different eventprofiles, additional rules in Listing 8 must be used.

IV. EVALUATIONS

To evaluate the system, we examine each component usingthe dataset that has been published by the City of Aarhus.It contains various sensor data including 449 pairs of traffic

Listing 8. Entail subpatterns among different event services.

sensors, and one weather sensor. We use these real sensors asthe basis of our experiment dataset. We also simulate 449 airpollution sensors, each one is hypothetically deployed next toa traffic sensor to report the air pollution index at that loca-tion. In addition to the ODAA datasets, we also use the HereTraffic10 dataset provided by Nokia11 for analysis of QoI. Theevaluations are organized in three parts as follows.

1) Data Aggregation: We evaluate the performance of theSAX algorithm on time series traffic data obtainedfrom the City of Aarhus. We examine the performancedifference between the virtualized/annotated raw dataobservations and the results obtained with SAX usingvarious segmentation sizes. We evaluate our systembased on two main criteria: a) data reduction and b) datareconstruction error rate.

2) Quality Analysis: The evaluations involve two parts:a) single data stream and b) multiple dependent datastream. In the first part, we provided a comprehen-sive technical analysis of the integrity of a single datastream, and further evaluated our system with a multipledependent data approach comparing ODAA data streamsagainst the Here Traffic dataset to analyze the incidentsthat occurred in the City of Aarhus.

3) Semantic Discovery and Integration of Data Streams:We create different user requests about traffic monitoringon different routes in Aarhus city and use the discoveryand composition algorithms to find the optimal com-bination of sensors to fulfil the requests. We test theperformance of the sensor discovery and quality-awarecomposition algorithms in terms of execution time andthe quality of composition results. After transformingthese composition results into complex event process-ing queries, we test the performance of evaluating thequeries based on raw and aggregated data streams.

A. Smart City Traffic Management Scenario

The City of Aarhus has deployed a set of traffic sensorson the streets. These sensors are paired as start nodes andend nodes. Each pair is capable of monitoring the averagevehicle speed v and vehicle count n on a street segment (fromthe start node to the end node). Combined with the distance dbetween the two sensors, the estimated travel time t = d/v andcongestion level c = n/d can be easily derived. In addition totraffic sensors, there are also sensors, such as weather sensorsand air pollution sensors deployed in the city, and a user canplan his/her travel taking different physical measurements intoconsideration. For example, some may be traveling on a tight

10[Online]. Available: https://www.here.com/11[Online]. Available: http://nokia.com

Page 11: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

KOLOZALI et al.: OBSERVING PULSE OF CITY 2661

(a) (b)

Fig. 5. Summary of the evaluation results for the average reconstruction rate of SAX over different segment sizes depicted in 5(a), and the data size ofthe aggregated data with various segmentation sizes after RDF serialization in 5(b). The bars refer to the following evaluation metrics: euclidean distancebetween original and reconstructed data and average data size in MB based on segmentation with every 30 min (S30m), segmentation with every hour (S1h),segmentation with every 2 h, segmentation with every 4 h data (S4h). (a) Data reconstruction error rate. (b) Data size.

schedule so they need the route with lowest estimated traveltime, while some others maybe cycling and they need routeswith less cars and better air quality. Despite the functionalrequirements, different users may have different nonfunctionalrequirements as well.

B. Data Aggregation

Prior to presenting the evaluation of the data discoveryand integration, we present evaluation of our approach fordata aggregation. To examine the performance of the aggre-gation method, we applied four different segmentation sizes,namely 30 mins (S30m), 1 h (S1h), 2 h (S2h), and 4 h (S4h)to divide dataset into frames with alphabet size of 5. Wemeasured the average reconstruction rate using Euclidean dis-tance measured between original and reconstructed data, andexamined whether or not there was a significant difference byusing analysis of variance test. Afterward, we calculated thedata size and analyzed the effects on the RDF representationsof the raw and aggregated sensor observations. The evalua-tions were performed on a personal computer (PC) runningUbtuntu 14.04 operating system with an Intel Core i5-34703.2 GHz processor and 8 GB RAM memory. The performanceof the system based on different segmentation sizes is reportedin Fig. 5.

Results show that the lowest data reconstruction error wasobtained for the S30m, where average speed was in rangeof 11.2 ≤ � S30m ≤ 12.9, and vehicle count was inrange of 2.4 ≤ � S30m ≤ 3.4. For the S1h, there was anincrease in reconstruction error rate (17.7 ≤ � S30m ≤ 19.5for average speed, and 4.0 ≤ � S30m ≤ 5.0 for vehiclecount). Compared to S30m, it is worth pointing out that therewas a (highly significant) difference between S30m and S1h(p ≤ 0.05). Subsequently, highly significant differences werefound between S1h and S2h (p ≤ 0.001), and S2h and S4h(p ≤ 0.001) in terms of error rate. Moreover, the error ratecontinued to increase in parallel to the number of segmentsizes. For, 2h, the error rate was between 27.2 ≤ � S2h ≤ 29.0for average speed, and 6.6 ≤ � S2h ≤ 7.6 for vehicle count,and was highest for S4h, 39.7 ≤ � S4h ≤ 41.3 for averagespeed, and 9.9 ≤ � S4h ≤ 10.9 for vehicle count.

On the contrary, the increase in segmentation size had adifferent effect on the data size. Although the quality of theaggregated data reduced with the increasing size of segmen-tation, there was (highly significant) data reduction for theannotated data streams. While the annotated raw data was ini-tially between 60.6 MB ≤ � RawData ≤ 60.9 MB, withthe segmentation of S30m data size (significantly) droppedto 3.3 MB ≤ � S30m ≤ 3.5 MB (p ≤ 0.001). Therewere also (highly significant) differences for all other param-eters (p ≤ 0.001). As a result, the data size was reduced to1.8MB ≤ � S1h ≤ 2.1MB for the S1h, 0.9 MB ≤ � S2hm≤ 1.2 MB for the S2h, 0.5 MB ≤ � S4hm ≤ 0.7 MB for theS4h. Overall, data size was reduced in range of 94.24%–99.8%depending on segmentation size.

Consequently, we found that segmentation size has a (highlysignificant) impact on the data size and the quality of the datareconstruction for time series data. However, since the changein segmentation size causes opposite effect on the data sizeand reconstruction error rate results, the selection of the rightparameters plays an important role as a tradeoff. However, it isworth to point out that our system has to operate in real-time,and our aim is not only to obtain good performance in termsof minimum error in reconstruction rate and data size, but alsoto execute complex event processing on top of the obtainedoutputs once the data streams are aggregated. Therefore, it wasinteresting to see that although there was significant differencebetween segmentation size, there was no significant differenceof the data aggregation results on the complex event processingas we will detail in Fig. 13(b) in Section IV-D.

C. Quality of Information—Analysis

This section presents experimental results regarding thequality of sensor observations in the City of Aarhus(Denmark). The evaluation includes both single data andmultiple data stream sources describing the city’s traffic flow.In the evaluation, we took into account subcategories of thetwo main categories (accuracy and timeliness) in the QoIdescription. Rated values mentioned in this section refer tothe output of the reward and punishment algorithm introducedin Section II. The following evaluations were performed on

Page 12: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2662 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

Fig. 6. Distribution of QoI values for the QoI Frequency metric.

a PC with a Core2 Duo CPU 2.80 GHz and 4 GB memoryrunning an Ubuntu 14.04 operating system.

1) Quality Evaluation of Single Data Streams: In the exper-iments, we used the following subset of QoI metrics, whichare introduced in Table II and calculated as follows in thecontext of Aarhus traffic data.

a) Completeness (Accuracy): Each observation withinthe data stream is composed of a set of features. An obser-vation is considered complete if all features are present andcontain valid values. The validity of a feature has to be defineddepending on the context. For example, is an empty string avalid attribute or the absence of it? The Aarhus traffic sensornetwork in our experiment used a Web service as a bridgeto provide access to the sensor network. Direct access to thesensor nodes was not possible. Since the Web service alwaysdelivered the last observations, completeness in our experimentwas always 100%, unless the Web service itself was not reach-able. The omitted observations only affected the frequencymetric.

b) Frequency (Timeliness): We calculated just over4 million frequency QoI annotations for 1 month of traffic datastreams. Fig. 6 shows the distribution of QoI values annotatedduring the experiment. As shown in the figure, about 75%have been rated with 0.95 or higher and almost 95% havebeen rated 0.8 or higher. Fig. 7 shows that about 50 of the449 sensor nodes have an average frequency QoI value of 0.9or less, but still over 0.85. About 290 sensor nodes more thanhalf of the sensor nodes observed during the experiment, couldreach an average frequency QoI of 0.95 or higher.

c) Correctness (Accuracy): It is a well-known fact thatin the meteorology community, it is very challenging to mea-sure real world physical values. Many error sources, such asnoise, drift, and aging of the sensing equipment, influence themeasurements in a random and unpredictable way. Dependingon the available information, different strategies to compen-sate for those influences can be applied. Investigating a singlesensor observation limits the validation to check for possibleand probable value ranges. A series of observations of a singlesensor allows the system to check if the sensor got stuck ata certain value or to detect outliers (such as sudden spikes).In case of multiple available sensors, a common strategy isto use redundant sensors, measuring the same or a correlatedvalue and to perform validations between both sensors. In our

Fig. 7. Average for the QoI metric and how many sensor nodes could reachat least this average.

Fig. 8. Distribution of QoI values for the QoI metric correctness.

experiment, the three nearest neighboring sensors are used toevaluate the correctness of each observation. Fig. 8 depictsthe distribution of annotated QoI values for the correctnessmetric.

2) Quality Evaluation of Multiple Dependent Data Streams:The evaluation of single data streams enables a comprehensivetechnical analysis of the integrity of the stream, whereby theanalysis of the individual data values cannot be assured if thereare no directly comparable data streams available. The utiliza-tion of corresponding high level information (events) publishedby another issuer through a distinct data stream enables dis-covery of consistent and contradictory information. To evaluatethe average speed and vehicle count data streams, which arepublished via ODAA platform, the traffic incident informa-tion12 for congestion and closed roads has been investigatedto get a confirmation of reported traffic events. Therefore, thetimespan of the incident is considered as a period, where thetraffic flow should change. The incident data set describesthe start time tIstart and the end time tIend of the congestionincident, which defines the incident period pI = [tIstart, tIend]with the duration dI of the incident. To evaluate the depen-dency between the streams, time series of the average speedis investigated for pBefore = [tIstart − dIncident, tIstart) beforeand for pAfter = (tIend, tIend + dIncident) after the congestionperiod pIncident. To remove daily repeating traffic patterns and

12Nokia Here traffic.

Page 13: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

KOLOZALI et al.: OBSERVING PULSE OF CITY 2663

Fig. 9. Comparing average speed before, during, and after a traffic incident.

analyze the trend of the time series independently of the sea-sonal components, the seasonal components were removed.The quartiles Q1, Q2, and Q3 [the 25th, 50th (median) and75th percentiles] of pI are now compared to pBefore and pAfter(see Fig. 9). If the majority of quartiles leads to an increaseof the average speed from pBefore to pI and a decrease frompI to pAfter we detect a correlation in the detected events ofour data streams which increases their reputation and consoli-dates their correctness level. The evaluation of the comparisonof 25 distinct traffic incidents against the ODAA data showsonly a 76% confirmation of the traffic incident reports usingthe raw data stream. By taking into account the results of thefrequency QoI analysis (see Section IV-C1) missing data canbe identified as a cause of this low confirmation rate. By ignor-ing ODAA data streams that had noticeable faults for singlestream quality analysis, the confirmation of reported eventsincreases to 83%.

D. Complex Event Service Composition and QueryProcessing

In this section, we evaluate the performance of the eventservice composition algorithms using service descriptionsannotated with CES ontology and quality ontology used inthe semantic annotation. We also test the complex event pro-cessing based on primitive sensor observations as well as thesummarized observations produced by the data aggregationcomponent described in Section III-B. In the following, wefirst present the use case scenario designed for the experi-ments, followed by a decision of the datasets used and finallythe results and discussion on the results.

1) Performances of QoS-Aware Complex Event ServiceComposition: To test the performance of the system we createthree user requests on the average vehicle speed and vehiclecounts of the routes in Aarhus, as shown in Fig. 10. Q1 is ashort path of 1.5 km, Q2 is a medium path of 6.5 km, andQ3 is a long path of 12.1 km. In the following, we test theperformance of the sensor stream discovery and integrationalgorithms with regard to these queries. All experiments arecarried out on a MacBook Pro with a 2.53 GHz duo core CPUand 4 GB 1067 MHz memory, prototypes are developed usingJDK 1.7.

TABLE IIISIMULATED SENSOR REPOSITORIES

The discovery and composition algorithm finds 2, 5, and 10traffic sensors relevant for Q1, Q2, and Q3, respectively. Totest the algorithm on a larger scale, we further increase the sizeof the sensor repositories by creating N functional equivalentdummy sensors with random NFPs for each sensor in R0 (i.e.,the original sensor repository with 899 sensors), resulting innine different sensor repositories, as listed in Table III.

The results in Fig. 11(a) indicate that the composition timeof a brute-force (BF) enumeration grows exponentially withregard to the complexity of the query (number of sensorsinvolved) and to the number of sensors in the repository. Whenwe apply the GAs13 over Q3 on R2, we save approximately80% of the execution time. However, for Q1 and Q2, the solu-tion space is too small for the GA evolution. The results inFig. 11(b) shows that the composition time in GA approachfor Q2 and Q3 over R3 to R9. The results indicate that the GAcomposition time grows almost linearly with regard to the sizeof the repository. To further analyze the performance of theGA algorithm we calculate the aggregated QoS utility for thecomposition results derived by GA. The detailed calculation ofthe utility is described in [55], and higher utility means bettera composition plan according to user-defined QoS constraintsand preferences. Fig. 12(a) shows the QoS utility derived forQ2 over R3 to R9. The QoS utility derived by GA is com-pared to the maximum and minimum utility derived from BFenumeration. To better understand the effectiveness of GA wecalculate a q-score

q-score = u − umin

umax − umin(5)

where u is the QoS utility for the GA results, umax and uminare the maximum and minimum QoS utility of the repository,respectively.

13GA parameters: crossover rate = 95%, mutation rate = 5%, initialpopulation size = 200, selection policy: Roulette Wheel + Elitism.

Page 14: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2664 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

(a) (b) (c)

Fig. 10. Different route queries in Aarhus, captured from Google Maps. (a) Q1. (b) Q2. (c) Q3.

(a) (b)

Fig. 11. Performance of GAs: execution time. (a) BF versus GA. (b) GA execution time over different repositories.

(a) (b)

Fig. 12. Performance of GAs: quality of results. (a) QoS utility of GA results. (b) GA cost effectiveness.

Fig. 12(b) shows the q-score of Q2 over R3 to R9. Theresults indicate that GA (with the aforementioned parameters)can produce 66%–99% optimal QoS utility for Q2. The resultsalso suggest that GA gives worse results over larger reposito-ries, because the larger the repository, the less likely the bestresults will be generated. However, this does not mean GA ismore suitable for small repositories. To explore whether usingGA is cost-effective, we observe the execution time of GA

and BF for Q2 over R3 to R9 and calculate a cost-effectivescore c-score

c-score = q-score × tbf

tga(6)

where tga and tbf are the execution time of GA and BF algo-rithms, respectively. In this equation, we take the q-score asthe gain and the time ratio of tga and tbf as the cost and hencecalculate the cost-effectiveness.

Page 15: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

KOLOZALI et al.: OBSERVING PULSE OF CITY 2665

(a) (b)

Fig. 13. CEP performance over primitive and aggregated data. (a) Input and Output sizes. (b CEP processing delays.

Listing 9. C-SPARQL query for Q3.

Results in 12(b) show that the c-score increases when GA isapplied to larger repositories, despite the fact that the q-scoredecreases. This is because, the time save by GA (i.e., thecost reduced) is greatly increased even though it is slightlymore difficult to get good composition results in larger repos-itories using GA. For R3 and R4 the c-score is below 1,suggesting GA less cost-effective than BF over these tworepositories. However, as the repository size gets bigger, thec-score increased to above 5, indicating that on R9 GA is fivetimes as cost-effective as BF.

2) Performances of Complex Event Processing OverPrimitive and Aggregated Observations: Once we have theevent service composition plans, we transform them intostream query and deploy the queries on stream engines todetect complex events using algorithms described in [56]. Inthis section, we test the performance of complex event process-ing when using raw and aggregated data streams. The streamengine used is C-SPARQL [57]. Listing 9 shows a snippet ofthe C-SPARQL query transformed from Q3.

We collected the traffic data (i.e., raw and aggregated) forthe ten sensors involved in Q3 over one day and replay them.The replayed streams are fed to the C-SPARQL engine. Theoriginal traffic data streams from ODAA create one reportcontaining one observation for average speed and one obser-vation for vehicle count every 5 min, an aggregated traffic datais abstracted out of 6, 12, 24, and 48 ODAA traffic reports,i.e., updated every 30, 60, 120, and 240 min (i.e., same seg-mentation sizes as in Section III-B). In our experiments, we

test the amount of messages consumed and produced by theC-SPARQL engine as well as the processing delay. To seehow the C-SPARQL engine performs with higher through-put, we accelerate the update frequency 150 times. Under thisaccelerated rate the raw and aggregated data reports every 2 to96 s, respectively. As a result, the input rate of the C-SPARQLengine varies from 40 to 0.83 triples per second in our exper-iments.14 Fig. 13(a) shows the amount of inputs and outputsand Fig. 13(b) shows the average processing delay over time.

From the results in Fig. 13(a), we can see that using aggre-gated data over 30 min reduces 80% of the data consumed, and63% of the results produced compared to using raw data, andthe I/O traffic decreases linearly to the segmentation size. Theresults in Fig. 13(b) show the processing delay decreases overthe experiment time (C-SPARQL engine warming-up), and forraw data and aggregated datasets, the delay decreased to lessthan 1000 ms within 125 s of experiment time. The resultsalso show that using aggregated data with larger segmentationsize has slower decrease rate. If we extend the experimentuntil all processing delays are stabilized, the difference ofdelays are small, i.e., they all converged to about 600 ms(+/−50 ms). However, when we further accelerate the rawdata stream frequency to 1 report per second (i.e., 80 triplesper second, relatively high for the large query with many joinsin Listing 9), the C-SPARQL engine becomes unstable and theprocessing delay increases to up to 4000 ms after a while andthen the engine stopped giving query results.

V. DISCUSSION AND CONCLUSION

In this paper, we examined large-scale stream processingincluding data aggregation, quality analysis, as well as seman-tic data discovery, integration, and complex event processingissues in the domain of smart city applications. We showedhow techniques, such as SAX can be used to reduce the sizeof data and presented different techniques for accessing andprocessing semantically annotated data streams.

14Every observation is annotated with 4 triples for streaming, less thanthose annotated in Listing 1 because some triples there are only useful forhistorical analysis.

Page 16: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2666 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

A. Data Aggregation/Reduction Without Loosing EssentialInformation

Many time-series analysis approaches have been introducedwithin the last decades. The key issues are to keep good qual-ity synopsis of data and to provide lower bounding supportfor the reduced data. SAX algorithm provides a considerabledata reduction with accurate correlation between the originaltime series and the symbolic representation, while supportinglower bounding principle. Using SAX algorithm in our exper-iments, we obtained significantly reduced data for each sensorobservation, which possess an average data size of 60.75 MB,reducing the content of data in range of 94.24% and 99.18%depending on the segmentation size. On the other hand, it wasalso interesting to see how the change in segmentation size issignificantly affected by the quality of time series data in anunfavorable way, which caused an overall error rate in range of11.2 and 41.3 km/h for average speed, and 4.0 and 10.9 num-ber of vehicles. Overall, the results lead to a conclusion thatS30m, which provided the minimum error and no large differ-ence on the complex event processing is an ideal segmentationsize for our framework.

B. Quality Data Leads to Intelligent Decisions

The QoI provided by the data streams is determined in anobjective and application independent kind of way. A normal-ized QoI value enables comparability and eases the streamselection. Distinct measurable QoI metrics like the frequencyare evaluated for every observation and compared against theinitial annotated metric. This allows a continuous observa-tion of the performance of the sensors over their lifetime andensures reliable execution of the application using them.

The usage of an event ontology, which is describing theeffects of incidents, will enable the modeling of the mutualimpact of various data streams. This information can furtherbe used to evaluate the correctness of sensor observations andevents.

C. On-Demand Data Discovery, Federation, and ComplexEvent Processing

In the smart city environment, smart city applications facil-itate its users by allowing to submit real time queries whichcan only be answered by complex event processing overvarious data streams. Considering the huge number of datastreams availability within smart city infrastructure, automateddiscovery, and composition of relevant data streams for com-plex event processing is a taunting task. CityPulse frameworkenables CEP engines to perform complex event processingwhile catering for nonfunctional properties (e.g., complete-ness, timeliness, accuracy etc.) of the data streams contributingwithin the composition plan.

Using the BF enumeration over available data streams fordiscovery and composition results in the exponential growthof composition time, which is not scalable for real-time smartcity applications. However, application of GA over a largerepository of data streams resulted in better performance forcost effectiveness without much compromise over the opti-mality. Results shown in this paper are conducted under a

specific setting for the tuning GA parameters. However, theperformance of GA can vary on different settings over a num-ber of parameters, such as initial population size, selectionpolicy, and mutation rate. There is always a tradeoff betweenthe processing time and quality of the results produced byGA and finding the best settings to fine tune GA parametersis heavily dependent over application domain. In the context ofthe smart city applications, real-time applications can decreasethe initial population size to reduce the processing time witha little compromise over the quality of the results. Moreover,effectiveness of the GA is better realized over a large datasetof data stream repository [as depicted in Fig. 12(b)], whichmeets the requirement of large-scale stream federation andprocessing for smart city applications.

In summary, the data abstraction/aggregation was found tohave two effects on real-time stream processing: 1) data aggre-gation can significantly decrease the I/O traffic of the streamengine and 2) data aggregation can reduce the stream rate sothat the stream engine can process more complicated querieswithout becoming unstable, and although slower streamingrate results in slower warming-up period, when the warming-up phase ends the stabilized delay is similar for the samequery, regardlessly of the streaming rate.

D. Future Work

In future work, we will further investigate data aggregationprocess to obtain an adaptive segmentation size, where thesystem will proceed aggregation solely when there is an event.In addition, due to the fact that it is possible to have time seriesdata, which is not obeying to Gaussian distribution and this canworsen the efficiency of the reduction, we will investigate waysto compensate it with different data distributions. We will alsowork on event discovery and analysis in large-scale distributedsmart city data streams. Reusing the developed techniques andmodels in different use-case scenarios and applications will bealso explored in our future work.

REFERENCES

[1] M. Naphade, G. Banavar, C. Harrison, J. Paraszczak, and R. Morris,“Smarter cities and their innovation challenges,” Computer, vol. 44,no. 6, pp. 32–39, Jun. 2011.

[2] S. Kolozali, D. Puschmann, M. Bermudez-Edo, and P. Barnaghi, “Onthe effect of adaptive and nonadaptive analysis of time-series sensorydata,” IEEE Internet Things J., vol. 3, no. 6, pp. 1084–1098, Dec. 2016.

[3] M. Hashmi et al., “Preliminary results from the cope study usingprimary-care electronic health records and environmental modelling toexamine COPD exacerbations,” Brit. J. Gen. Pract., vol. 68, no. S1,2018.

[4] L. Yang, W. Li, M. Ghandehari, and G. Fortino, “People-centric cogni-tive Internet of Things for the quantitative analysis of environmentalexposure,” IEEE Internet Things J., vol. 5, no. 4, pp. 2353–2366,Aug. 2018.

[5] P. Barnaghi, A. Sheth, and C. Henson, “From data to actionable knowl-edge: Big data challenges in the Web of Things,” IEEE Intell. Syst.,vol. 28, no. 6, pp. 6–11, Nov./Dec. 2013.

[6] P. Bellini, M. Benigni, R. Billero, P. Nesi, and N. Rauch, “Km4Cityontology building vs data harvesting and cleaning for smart-city ser-vices,” J. Vis. Lang. Comput., vol. 25, no. 6, pp. 827–839, Dec. 2014.

[7] L. Sanchez et al., “SmartSantander: IoT experimentation over a smartcity testbed,” Comput. Netw., vol. 61, pp. 217–238, Mar. 2014.

[8] D. Pfisterer et al., “SPITFIRE: Toward a semantic Web of Things,” IEEECommun. Mag., vol. 49, no. 11, pp. 40–48, Nov. 2011.

Page 17: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

KOLOZALI et al.: OBSERVING PULSE OF CITY 2667

[9] J. Soldatos et al., “OpenIoT: Open source Internet-of-Things in thecloud,” in Interoperability and Open-Source Solutions for the Internetof Things. Cham, Switzerland: Springer, 2015, pp. 13–25.

[10] S. De, F. Carrez, E. Reetz, R. Tönjes, and W. Wang, Test-EnabledArchitecture for IoT Service Creation and Provisioning. Heidelberg,Germany: Springer, 2013.

[11] E. Kalampokis et al., “Exploiting linked data cubes with open-cube toolkit,” in Proc. 13th Int. Semantic Web Conf. (ISWC), 2014,pp. 137–140.

[12] P. Vlacheas et al., “Enabling smart cities through a cognitive man-agement framework for the Internet of Things,” IEEE Commun. Mag.,vol. 51, no. 6, pp. 102–111, Jun. 2013.

[13] F. Lécué et al., “STAR-CITY: Semantic traffic analytics and reason-ing for CITY,” in Proc. 19th Int. Conf. Intell. User Interfaces, 2014,pp. 179–188.

[14] A. Bar-Noy et al., “Quality-of-information aware networking for tac-tical military networks,” in Proc. IEEE Int. Conf. Pervasive Comput.Commun. Workshops (PERCOM Workshops), 2011, pp. 2–7.

[15] C. Bisdikian, L. M. Kaplan, and M. B. Srivastava, “On the quality andvalue of information in sensor networks,” ACM Trans. Sensor Netw.(TOSN), vol. 9, no. 4, p. 48, 2013.

[16] J. Benesty, J. Chen, and Y. Huang, “On the importance of the Pearsoncorrelation coefficient in noise reduction,” IEEE Trans. Audio, Speech,Language Process., vol. 16, no. 4, pp. 757–765, May 2008.

[17] K. Aberer, M. Hauswirth, and A. Salehi, “The global sensor networksmiddleware for efficient and flexible deployment and interconnection ofsensor networks,” EPFL, Lausanne, Switzerland, Rep. LSIR-REPORT-2006-006, 2006.

[18] A. Bassi et al., Enabling Things to Talk: Designing IoT Solutions Withthe IoT Architectural Reference Model. Heidelberg, Germany: Springer,2013.

[19] F. den Hartog, L. Daniele, and J. Roes, Study on Semantic Assetsfor Smart Appliances Interoperability: D-s1: First Interim Report, Eur.Commission, 2014.

[20] F. Cicirelli, G. Fortino, A. Guerrieri, G. Spezzano, and A. Vinci,“Metamodeling of smart environments: From design to implementation,”Adv. Eng. Informat., vol. 33, pp. 274–284, Aug. 2017.

[21] G. Aloi et al., “Enabling IoT interoperability through opportunisticsmartphone-based mobile gateways,” J. Netw. Comput. Appl., vol. 81,pp. 74–84, Mar. 2017.

[22] A. Altomare, E. Cesario, C. Comito, F. Marozzo, and D. Talia,“Trajectory pattern mining for urban computing in the cloud,” IEEETrans. Parallel Distrib. Syst., vol. 28, no. 2, pp. 586–599, Feb. 2017.

[23] F. Cicirelli et al., “Edge computing and social Internet of Things forlarge-scale smart environments development,” IEEE Internet Things J.,vol. 5, no. 4, pp. 2557–2571, Aug. 2018.

[24] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient similarity search insequence databases,” in Proc. 4th Conf. Found. Data Org. Algorithms,1993, pp. 69–84.

[25] A. Haar, “‘Zur theorie der orthogonalen funktio- nensysteme,’ German,”Mathematische Annalen, vol. 69, nos. 331–371, pp. 38–53, 1910.

[26] Z. R. Struzik and A. Siebes, “The haar wavelet transform in the timeseries similarity paradigm,” in Principles of Data Mining and KnowledgeDiscovery. Heidelberg, Germany: Springer, 1999, pp. 12–22.

[27] K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani, “Locally adap-tive dimensionality reduction for indexing large time series databases,”ACM Trans. Database Syst., vol. 27, no. 2, pp. 188–228, 2002.

[28] K. Chakrabarti, E. Keogh, M. Pazzani, and S. Mehrotra, “Dimensionalityreduction for fast similarity search in large time series databases,” J.Knowl. Inf. Syst., vol. 3, no. 3, pp. 263–286, 2001.

[29] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculationof complex Fourier series,” Math. Comput., vol. 19, no. 90, pp. 297–301,1965.

[30] C. Cassisi, P. Montalto, M. Aliotta, A. Cannata, and A. Pulvirenti,Similarity Measures and Dimensionality Reduction Techniques for TimeSeries Data Mining, Magnum Publ. LLC, 2012, pp. 71–96, ch. 3.

[31] H. André-Jönsson and D. Z. Badal, “Using signature files for queryingtime-series data,” in Proc. 1st Eur. Symp. Princ. Data Min. Knowl. Disc.,1997, pp. 211–220.

[32] Y.-W. Huang and P. S. Yu, “Adaptive query processing for time-series data,” in Proc. 5th Int. Conf. Knowl. Disc. Data Min., 1999,pp. 282–286.

[33] J. Lin, E. Keogh, S. Lonardi, and B. Chiu, “A symbolic representationof time series, with implications for streaming algorithms,” in Proc. 8thACM SIGMOD Workshop Res. Issues Data Min. Knowl. Disc., 2003,pp. 2–11.

[34] D. M. Strong, Y. W. Lee, and R. Y. Wang, “Data quality in context,”Commun. ACM, vol. 40, no. 5, pp. 103–110, 1997.

[35] B. Stvilia, L. Gasser, M. B. Twidale, and L. C. Smith, “A framework forinformation quality assessment,” J. Amer. Soc. Inf. Sci. Technol., vol. 58,no. 12, pp. 1720–1733, 2007.

[36] C. Bisdikian et al., “Building principles for a quality of informationspecification for sensor information,” in Proc. IEEE 12th Int. Conf. Inf.Fusion, 2009, pp. 1370–1377.

[37] M. K. Aguilera, R. E. Strom, D. C. Sturman, M. Astley, andT. D. Chandra, “Matching events in a content-based subscriptionsystem,” in Proc. 8th Annu. ACM Symp. Princ. Distrib. Comput., 1999,pp. 53–61.

[38] E. Wu, Y. Diao, and S. Rizvi, “High-performance complex event pro-cessing over streams,” in Proc. ACM SIGMOD Int. Conf. Manag. Data,2006, pp. 407–418.

[39] G. Cugola and A. Margara, “Processing flows of information: From datastream to complex event processing,” ACM Comput. Surveys (CSUR),vol. 44, no. 3, p. 15, 2012.

[40] T. Moser, H. Roth, S. Rozsnyai, R. Mordinyi, and S. Biffl, “Semanticevent correlation using ontologies,” in On the Move to MeaningfulInternet Systems: OTM 2009. Heidelberg, Germany: Springer, 2009,pp. 1087–1094.

[41] K. Taylor and L. Leidinger, “Ontology-driven complex event processingin heterogeneous sensor networks,” in The Semanic Web: Research andApplications. Heidelberg, Germany: Springer, 2011, pp. 285–299.

[42] K. Teymourian and A. Paschke, “Semantic rule-based complexevent processing,” in Rule Interchange and Applications. Heidelberg,Germany: Springer, 2009, pp. 82–92.

[43] M. Alrifai and T. Risse, “Combining global optimization with localselection for efficient QoS-aware service composition,” in Proc. 18thInt. Conf. World Wide Web, 2009, pp. 881–890.

[44] M. C. Jaeger, G. Rojec-Goldmann, and G. Muhl, “QoS aggregationfor Web service composition using workflow patterns,” in Proc. 8thIEEE Int. Enterprise Distrib. Object Comput. Conf. (EDOC), 2004,pp. 149–159.

[45] Q. Wu, Q. Zhu, and X. Jian, “QoS-aware multi-granularity service com-position based on generalized component services,” in Service-OrientedComputing (LNCS 8274), S. Basu, C. Pautasso, L. Zhang, and X. Fu,Eds. Heidelberg, Germany: Springer, 2013, pp. 446–455.

[46] L. Zeng et al., “QoS-aware middleware for Web services composition,”IEEE Trans. Softw. Eng., vol. 30, no. 5, pp. 311–327, May 2004.

[47] R. Berbner, M. Spahn, N. Repp, O. Heckmann, and R. Steinmetz,“Heuristics for QoS-aware Web service composition,” in Proc. IEEEInt. Conf. Web Services (ICWS), Chicago, IL, USA, 2006, pp. 72–82.

[48] L.-J. Zhang and B. Li, “Requirements driven dynamic services com-position for Web services and grid solutions,” J. Grid Comput., vol. 2,no. 2, pp. 121–140, 2004.

[49] G. Canfora, M. D. Penta, R. Esposito, and M. L. Villani, “A lightweightapproach for QoS-aware service composition,” in Proc. 2nd Int. Conf.Service Orient. Comput. (ICSOC), 2004.

[50] C. Zhang, S. Su, and J. Chen, “A novel genetic algorithm for QoS-aware Web services selection,” in Proc. 2nd Int. Conf. Data Eng. IssuesE Commerce Services (DEECS), 2006, pp. 224–235.

[51] C. Gao, M. Cai, and H. Chen, “QoS-aware service composition based ontree-coded genetic algorithm,” in Proc. 31st Annu. Int. Comput. Softw.Appl. Conf. (COMPSAC), vol. 1. Beijing, China, 2007, pp. 361–367.

[52] F. Karatas and D. Kesdogan, “An approach for compliance-aware ser-vice selection with genetic algorithms,” in Service-Oriented Computing(LNCS 8274), S. Basu, C. Pautasso, L. Zhang, and X. Fu, Eds.Heidelberg, Germany: Springer, 2013, pp. 465–473.

[53] F. Gao, E. Curry, and S. Bhiri, “Complex event service provision andcomposition based on event pattern matchmaking,” in Proc. 8th ACMInt. Conf. Distrib. Event Based Syst., Mumbai, India, 2014, pp. 71–82.

[54] M. A. Hossain, P. K. Atrey, and A. E. Saddik, “Modeling and assessingquality of information in multisensor multimedia monitoring systems,”ACM Trans. Multimedia Comput. Commun. Appl. (TOMCCAP), vol. 7,no. 1, p. 3, 2011.

[55] F. Gao, E. Curry, M. I. Ali, S. Bhiri, and A. Mileo, “QoS-aware complexevent service composition and optimization using genetic algorithms,”in Proc. 12th Int. Conf. Service Orient. Comput., 2014, pp. 386–393.

[56] F. Gao, M. I. Ali, and A. Mileo, “Semantic discovery and integrationof urban data streams,” in Proc. 5th Workshop Semantic Smarter Cities,2014, pp. 15–30.

[57] D. F. Barbieri, D. Braga, S. Ceri, E. D. Valle, and M. Grossniklaus, “C-SPARQL: SPARQL for continuous querying,” in Proc. 18th Int. Conf.World Wide Web, 2009, pp. 1061–1062.

Page 18: Observing the Pulse of a City: A Smart City Framework for ... · aspects of the study, namely, 1) semantic annotation and data aggregation; 2) QoI, stream dependency and correlating

2668 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 2, APRIL 2019

Sefki Kolozali received the B.Sc. degree incomputer engineering from Near East University,Nicosia, Cyprus, in 2005, the M.Sc. degree from theUniversity of Essex, Colchester, U.K., and the Ph.D.degree from the Queen Mary University of London,London, U.K. His thesis was entitled “AutomaticOntology Generation Based On Semantic AudioAnalysis.”

He is a Research Fellow with the Universityof Essex. He was the Workpackage Leader of thelarge scale data-analytics component of the EU

FP7 CityPulse project. He was a Research Associate in applied big dataanalysis with King’s College London, London, and a Research Fellow inlarge scale data analytics for the Internet of Things with the University ofSurrey, Guildford, U.K. His main research interests include signal processing,machine learning, semantic Web technologies, and Internet of Things.

Maria Bermudez-Edo is a Lecturer (AssistantProfessor) with the University of Granada, Granada,Spain. Her research has appeared in the proceed-ings of various peer-reviewed international confer-ences and journals. Her current research interestsinclude semantic technologies, Internet of Things,and machine learning.

Nazli FarajiDavar received the M.Sc. and Ph.D.degrees in medical imaging and transfer learningin computer vision from the University of Surrey,Guildford, U.K., in 2015.

She is currently a Senior Research Fellow and aData Scientist with King’s College London, London,U.K., and a Visiting Fellow with the University ofOxford, Oxford, U.K. Her research work focuses onmachine learning and NLP applications in develop-ing predictive patient trajectory models using datafrom EHR. Her current research interests include

artificial intelligence, cognitive machine learning, transfer learning, NLP,image processing, cognitive robotics, eHealth, and Internet of Things.

Payam Barnaghi (S’04–A’05–M’06–SM’12) is aProfessor of machine intelligence with the Centrefor Vision, Speech and Signal Processing (CVSSP),University of Surrey, Guildford, U.K. He was theCoordinator of the EU FP7 CityPulse project. Hiscurrent research interests include machine learning,Internet of Things, semantic technologies, adaptivealgorithms, and information search and retrieval.

Feng Gao received the B.Sc. degree in softwareengineering from Wuhan University, Wuhan, China,in 2008, and the M.Eng. degree (Hons.) in telecom-munication from Dublin City University, Dublin,Ireland, in 2009. He is currently pursuing thePh.D. degree (Hons.) at the Insight Centre for DataAnalytics, National University of Ireland, Galway,Ireland.

His current research interests includes SemanticWeb, complex event processing, and servicecomputing.

Muhammad Intizar Ali received the Ph.D. degree(with distinction) from the Vienna University ofTechnology, Vienna, Austria, in 2011.

He is an Adjunct Lecturer, a Research Fellow, anda Project Leader with the Unit for Reasoning andQuerying with the Insight Centre for Data Analytics,National University of Ireland, Galway, Ireland. Hiscurrent research interests include semantic Web, dataintegration, Internet of Things (IoT), linked data,federated query processing, stream query process-ing, and optimal query processing over large scale

distributed data sources. He is actively involved in various EU fundedand industry-funded projects aimed with providing IoT enabled adaptiveintelligence for smart city applications and smart enterprize communicationsystems.

Dr. Ali is serving as a PC member of various journals, international con-ferences, and workshops. He is also actively participating in W3C efforts forstandardization with the RDF Stream Processing Community Group and Webof Things Interest Group.

Alessandra Mileo is a Tenured Assistant Professorwith the School of Computing and a FundedInvestigator with the Insight Centre for DataAnalytics, Dublin City University, Dublin, Ireland.She has developed a research program in knowledgerepresentation and stream reasoning. Her currentresearch interests include leveraging expressive rea-soning and statistical relational learning for scalabledecision making in dynamic environments.

Marten Fischer received the Diplom Informatikerdegree (First Class Hons.) in computer engineeringand computer science from the University of AppliedScience Osnabrück, Osnabrück, Germany, in 2009.His thesis concerned service creation environmentsand service deployment.

He is currently as a Research Assistant with theMobile Communication Research Group, Universityof Applied Sciences Osnabrück, where he isinvolved in a research project covering the topics ofautomated test generation and reliability in sensornetworks.

Thorben Iggena received the M.Sc. degree in com-puter engineering from the University of AppliedSciences Osnabrück, Osnabrück, Germany, in 2013.

He is a Research Fellow with the Universityof Applied Sciences Osnabrück. The master thesisdealt with reliable transport of data packages in ad-hoc, delay tolerant networks. His current researchinterests include data processing in context of smartcities, evaluation and analyzation of informationquality, and delay tolerant routing algorithms.

Daniel Kuemper is a Research Associate with theMobile Communication Research Group, Universityof Applied Sciences Osnabrück, Osnabrück,Germany. He was involved in the FP7 ICT IoT.estproject and is currently involved with research withthe CityPulse project. His current research interestsinclude geospatial analysis, mobile computing,auto-deployment, and configuration technologies.

Ralf Tonjes received the Dipl.-Ing. degree in elec-trical engineering from the University of Hannover,Hanover, Germany, the M.Phil. degree in biomedi-cal engineering from the University of Strathclyde,Glasgow, U.K., and the Dr.-Ing. degree from theUniversity of Hannover.

He is a Professor with the University of AppliedScience Osnabrück, Osnabrück, Germany, where heis heading the Mobile Communications Group. Hewas with Ericsson Research, Kista, Sweden, for sev-eral years. He has authored or co-authored over 100

papers and five patents. His current research interests include mobile and wire-less communication networks and Internet of Things, in particular for smartcities and agriculture.