discover and visualize association rules from sensor observations on the web

12
J Supercomput DOI 10.1007/s11227-011-0697-y Discover and visualize association rules from sensor observations on the web Meng Zhang · Byeong Ho Kang · Quan Bai © Springer Science+Business Media, LLC 2011 Abstract Nowadays, Web-based applications has became a common practice in en- vironment monitoring. These applications provide open platforms for users to dis- cover access and integrate near real-time sensor data which is collected from dis- tributed sensors and sensor networks. To make use of the shared sensor data on the Web, conceptual models in a particular domain are normally adopted. However, most conceptual models require high quality data and high level domain knowledge. Such limitations greatly limit the application of these models. To overcome some of these limitations, this paper proposes a data-mining approach to analyze patterns and rela- tionships among different sensor data sets. This approach provides a flexible way for users to understand hidden relationships in shared sensor data, and can help them to make use Web-based sensor systems better. Keywords Sensors and sensor networks · Web-based environmental monitoring · Data mining · Association rules · Knowledge discovery · Data presentation 1 Introduction Nowadays, more and more Web-based systems have been developed to publish and share observations collected by sensors and sensor networks [11]. The deployment of M. Zhang ( ) · B.H. Kang School of Computing and Information Systems, University of Tasmania, Hobart, Australia e-mail: [email protected] B.H. Kang e-mail: [email protected] Q. Bai School of Computing and Mathematical Sciences, Auckland University of Technology, Auckland, New Zealand e-mail: [email protected]

Upload: quan

Post on 01-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

J SupercomputDOI 10.1007/s11227-011-0697-y

Discover and visualize association rules from sensorobservations on the web

Meng Zhang · Byeong Ho Kang · Quan Bai

© Springer Science+Business Media, LLC 2011

Abstract Nowadays, Web-based applications has became a common practice in en-vironment monitoring. These applications provide open platforms for users to dis-cover access and integrate near real-time sensor data which is collected from dis-tributed sensors and sensor networks. To make use of the shared sensor data on theWeb, conceptual models in a particular domain are normally adopted. However, mostconceptual models require high quality data and high level domain knowledge. Suchlimitations greatly limit the application of these models. To overcome some of theselimitations, this paper proposes a data-mining approach to analyze patterns and rela-tionships among different sensor data sets. This approach provides a flexible way forusers to understand hidden relationships in shared sensor data, and can help them tomake use Web-based sensor systems better.

Keywords Sensors and sensor networks · Web-based environmental monitoring ·Data mining · Association rules · Knowledge discovery · Data presentation

1 Introduction

Nowadays, more and more Web-based systems have been developed to publish andshare observations collected by sensors and sensor networks [11]. The deployment of

M. Zhang (�) · B.H. KangSchool of Computing and Information Systems, University of Tasmania, Hobart, Australiae-mail: [email protected]

B.H. Kange-mail: [email protected]

Q. BaiSchool of Computing and Mathematical Sciences, Auckland University of Technology, Auckland,New Zealande-mail: [email protected]

M. Zhang et al.

Fig. 1 Web-based sensor systems

these systems enables users from different organizations to share, exchange and ac-cess sensor observations on an open platform (see Fig. 1), and makes possible for theintegration of web-enabled sensors and sensor systems in a global-scale collaborativesensing.

With huge amount of sensor data available on the Web, there is a strong require-ment to develop approaches which can exploit values of shared data and interpretthe raw sensor data in a more meaningful form. In most domains, these approachesare normally based on conceptual models, which have high requirements on dataquality and domain expertise. However, a common feature of Web-based systems isthat the users do not necessarily possess strong domain knowledge or expertise. Inaddition, to achieve cross-domain collaborations, it is important to have a commonmethod to analyze and present shared data. Data mining provides a more flexible wayto achieve knowledge discovery and presentation [8]. It can analyze hidden patternsand relationships among data from a data-driven perspective. Namely, it can be op-erated based on available data without high requirements on data quality and domainexpertise. In this paper, we propose the use of association rules in analyzing environ-mental observations. The proposed approach can interpret sensor observations to amore meaningful and useful form without a high level of domain expertise.

2 Association rule mining from environmental observations

The process flow of our approach is illustrated in Fig. 2. In the approach, sensorobservations are transformed into a proper format and sent to the association ruleanalysis component. Then a number of association rules will be generated from therule analysis component. These rules will be pruned and stored in a rule base. Finally,the stored rules can be presented based on users’ queries.

There are lots of conceptions or definitions to describe the data mining in differ-ent context. In this research, we define data mining as a practical topic and concludelearning in a practical, are not a theoretical. It is used to find and describe structural

Discover and visualize association rules from sensor observations

Fig. 2 The process flow of mining association rules from sensor observations

patterns of data like tools to help explain the relationship between data and makesome predictions from it [4]. What is more, the application of data mining is widelywhich contains for detecting fraud, assessing risk, and product retailing. Data min-ing allows discovering previously unknown, valid patterns and relationships in largedataset by using of data analysis tools [12].

For the hydrological domain, data mining plays a great deal of works in areas ofclustering, association rule and classification. Those sorts of methods can be appliedinto some data warehouse environments [3]. Data-mining technology has been de-veloped to focus on spatial-temporal environment and related events that control thedistribution of the living organism [12].

2.1 Association rules method description

An association rule is to find the potential relationship between data and make therules to predict the class or instances of the dataset. The classification rules is similarwith association rules which could be used to make prediction for class of the dataset.However, the main difference between association rule and classification rule is thatthe association rules can make prediction for the attributes of the dataset. What ismore, the amount of the association rules may be massive and some of them arerelated so that the rules need to reduce as well [12].

In addition, association rule mining is an application which leads to the discov-ery of associations and correlations among different items in related data sets [1].A classic example is the market-basket problem [2]. In transactional data sets of su-permarkets, rules like {bear} → {diapers} mean that most of customers who buybears, may buy diapers as well. Such rules will then suggest to the supermarket man-agers that they put bear and diapers together, as this may improve their profits. Fromthis example, it can be seen that association rules provide a useful mechanism fordiscovering relationships among the underlying data.

In this research, a rule is defined in the form A = {L => R}, where A is an as-sociation rule, L and R are two disjointed sets of events E (L ∩ R = φ). L is calledthe antecedent of A; and R is called the consequent of A. There are two importantconstraints for selecting useful rules, i.e., the minimum thresholds on support andconfidence [5]. Support is the ratio of the number of transactions that include allitems in the antecedent (see (1)). Confidence is the probability of finding the con-sequent in transactions under the condition that these transactions also contain the

M. Zhang et al.

Fig. 3 The South Eskhydrological sensor web

antecedent (see (2)).

S(L,R) = P(L) (1)

C(L,R) = P(L|R) (2)

There are a number of learning algorithms which can find association rules basedpredefined support and confidence thresholds. In this research, the Apriori algo-rithm [1] was adopted to build up rules.

2.2 Data selection, collection and normalization

In this research, we choose the South Esk Hydrological Sensor Web (SEHSW)1 asa test-bed for our approach [9]. SEHSW published sensor observations in the SouthEsk catchment, which covers an area of approximately 3350 square kilometers innorth-eastern of Tasmania, Australia. Figure 3 shows the sensor distribution on theSEHSW. As SEHSW is mainly focusing on collecting and publishing hydrologicalsensor observations, in this research, we particularly focus on analyzing the relation-ships between rainfall events and other phenomena, e.g., humidity, air-temperature,etc.

To achieve the purpose, firstly, we record the maximum or minimum values ofrainfall events from the sensor web, and extract them into a new database. To recordthe rainfall event, there are two different sensor records as shown on Fig. 4. The leftone sensor records the rainfall as a big bucket and the value will be recorded andimproved when rainfall happens so that the value of this sensor is always increaseduntil it reset. To catch the maximum value of rainfall in this sensor, we just calculatethe time gap between each changed value and the lower the time gap is the higherthe rainfall value is. Based on that phenomenon, we can catch the maximum rainfallvalue time.

The right status is another type of sensor to record the rainfall value. At this mo-ment, this sensor records the value like a small test-tube which has small capacity,

1http://www.csiro.au/sensorweb/au.csiro.OgcThinClient/OgcThinClient.html.

Discover and visualize association rules from sensor observations

Fig. 4 The two different types of sensor record about rainfall event

Table 1 The description of location and phenomenon

Location 1 Index (J) Phenomenon Index (K) Location 2 Index (L)

Ben Lomond 1 Humidity 1 English Town Road 1

Story Creek 2 Air-Temperature 2 Valley Road 2

Ben Ridge Road 3 Evaporation 3 Hogans Road 3

Avoca 4 Transpiration 4 Mathinna Plains 4

Tower Hill 5 Wind-Run 5 Tower Hill Road 5

when this test-tube catch the rainfall inside, it will update every settled time andrecord the rainfall value at that time. In order to catch the maximum value in this sen-sor, we can record the maximum record value and the maximum update frequency.

After recording the maximum time of the rainfall, In addition, to simplify the im-plementation, we also assign an index to each location and phenomenon type. Table 1describes index values for different location and phenomenon types. In this table, lo-cation 1 represents the locations that a corresponding phenomenon may happen (e.g.humidity). Location 2 represents the locations of rainfall events.

From Table 1, we can calculate the time gap between the maximum value or min-imum value of a rainfall event and another event within the same day by using (3).Then, we can get two sets of time gaps which indicate the time differences betweenthe rainfall event reaching its maximum value and another event reaching its maxi-mum and its minimum values. The time gaps are described in a continuous data type.However, association rules can only take nominal or ordinary data types as inputs.To satisfy this requirement, the continuous values need to be transferred to a nominalstyle. Here, we use a simple clustering technique to achieve the conversion. Figure 5shows the clustering method we used. The method transfers the continuous valuesinto nominal items by generating different clusters. Each cluster contains a range ofcontinuous time gaps. For instance, the cluster (Max_Gap(0_4)) covers time gapsbetween 0 and 4 hours.

Max_Gap(Min_Gap)JK = Max(Min)JK − MaxL (3)

M. Zhang et al.

Fig. 5 Transfer data into nominal style

2.3 WEKA workbench

We described the processes for data preprocessing in the previous subsections. Fi-nally, we can operate association rule mining on a platform or workbench. There area number of data-mining tools which can conduct association rule mining. In this re-search, we select the WEKA workbench to analyze hydrological data. WEKA, whichwas developed by the University of Waikato in New Zealand, provides a uniform in-terface to lots of algorithms for pre and post-processing, and evaluating the results oflearning schemes on any datasets. The WEKA workbench not only includes a librarythat contains a number of data analysis algorithms but, more importantly, it can alsoallow users to modify the algorithms in the library or add their own lug-ins based ontheir needs [10].

WEKA requires a specific input file format named ARFF. Figure 6 shows the dataformat of ARFF files. The attributes refer to the titles of items. The data in brackets(after each attribute) indicate the possible values of that attribute. The data after thelabel “@data” are values of items in transactions.

2.4 Generate association rules

We built up a dataset which included 443 instances to generate related rules. Duringthe process, we input 400 instances to generate rules and set up 43 instances (10%)to evaluate rules. We set the support threshold as 10% to 100%, and the confidencethreshold as 0.5. Due to the features of the hydrological data, the confidence thresh-old was not set very high. After generating the hydrological data, we had got lotsof rules to describe the relationship of the hydrological events. Different rules havedifferent support and confidence value. Figure 7 shows the distribution of confidenceand support for 17 rules. The X axis is the confidence value and the point is locatedinto the related support value. The total number of instances is 443 and Y axis showsthe support value. For instance, the support value is 125 means that there are 125 in-stances satisfying the rules. We filtered the rules which have big gap between supportand confidence value.

We know that the rules will be useless if the gap between support and confidencevalue is large. For example, the rule: {(item = rainfall, location = StorysCreek) =>(compardlocation = Avoca, max_gap = [3-7])} have 100% confidence value but with

Discover and visualize association rules from sensor observations

Fig. 6 ARFF data file

Fig. 7 Distribution of support and confidence

1 of support value which means in 400 instances, only 1 of them support this rules. Itmeans that this phenomenon is the rare event and the related rule is useless.

After the filtering all of rules, we have got 10 rules as for the result which havemore than 1/3 of total instances value support. Then we use another 44 instance tojustify the rules. Table 2 displays the result and evaluation of the association rules.

After the analysis, we obtained rules for hydrological data, the following is someexamples:

Rule = {(max_gap:[3-7]) => (item: humidity)} conf 0.83This rule means that regardless of the location, the humidity should attain its max-

imum value 3 to 7 hours later than rainfall attains its maximum value. For instance,

M. Zhang et al.

Table 2 The result ofassociation rules Rules Accuracy (40 instance evaluate)

Total number 10 Average 80%

With confidence > 80% 4 85%

With confidence (<80%) 6 75%

if a rainfall event peaks at 9:00 am, the humidity will attains its maximum value be-tween 12:00 pm and 4:00 pm. The support of this rule is the 20% of max_gap and40% of humidity. This rule has the high confidence with 83%. So based on this rule,we can find that the phenomenon types and time gaps between humidity and rainfall.In addition, we can make prediction about the maximum time of humidity via themaximum time of humidity.

Rule: {(Item = airtemperature), (location = BenRidgeRoad) => (max_gap[1.5-9]) conf 0.62

This rule has built up the relationship in most of attributes: location, item and valuetype. The detail information of this rule is that in Ben Ridge Road, the gap betweenair temperature of maximum value and rainfall of maximum value is 1.5 to 9 hours.It is easy to find that the range of time gap is large than previous rules because therelated items in this rule has expanded. Also, the confidence value becomes lowerwith only 62%. In addition, this time range (1.5–9) almost covers 1/3 daytime whichmeans the if we want to get more accurate value from this rule such as make the timerange smaller, the confidence will decrease sharply even lower than 40%. So this kindof rule is not as good as previous even it concludes more items.

The rules are basically related to phenomenon types and time gaps. Based on iden-tified rules, we can find relationships between some specific phenomenons (e.g. rain-fall and humidity), and furthermore, make some predictions of hydrological events.

In general, traditional hydrological model can generate more accurate results butwith very high costs. For example, the hydrological model can give predictions abouta specific location and phenomenon, e.g., water level in location A will increase by10 cm in the next 10 hours. However, these results are generated based on high qualityinput data. In addition, these models may require over 10 years of data for calibrationand training purposes. Such requirements will lead to very high cost of data prepara-tion, and greatly limit the applications of the models. On the other hand, data-miningmethods have much lower requirements on data quality. Most data-mining methodsare purely data-driven. Namely, the discovered information/knowledge is based onthe available data. Hence, even without long period of high quality data, we still canfind some useful patterns from the data. In addition, these patterns are easy to under-stand for general users within minimal domain knowledge.

3 Rule presentation

In order to assist users to understand discovered knowledge, data presentation be-comes an important issue for the success of the approach. Data presentation is a suit-able way to transfer the obscure data information into perceptible information [5, 7].

Discover and visualize association rules from sensor observations

Fig. 8 Information visualization techniques

A good presentation can make data easy to understand and provide clear informationfrom data to the potential audience. Data presentation also can prune redundant dataand provide a friendly interface.

According to Keim [6] different types of data have different methods to visualize.Figure 8 describes the general techniques for data visualization in different data types.It is easy to find that the data visualization methods can fulfill almost amounts of dataso that it cannot only present the sensor raw data in sensor web, but also visualize therelationship between data and hydrological events.

The target of data visualization is to simplify the complex or huge data and trans-fer into friendly interface. In addition, the visualization achievement is justified theuser preference model. User preference model is always required some AI techniquefor input in general, but it is not good for complex domains. So a reasonable approachis to build communications to allow user to support the decision and filter the infor-mation. To build up the user preference model can also help both domain experts andnormal users to understand complex process or information easily.

In this research, we focused on the presentation of discovered association rulesand tried to embed the presentation within the current interface of the South Esk Hy-drological Sensor Web. In order to describe the rules clearly, we built up a tree viewstructure to describe the relationships between locations, sensors and phenomenontypes (see Fig. 9). This will assist the users to understand the relationships betweenobserved phenomenon types and locations.

To present the rules, we provide a set of selection boxes to allow users to choosetheir preferred rules (see Fig. 10). Based on the “maximum value time” of a rainfallevent, there are four selection boxes to allow users to select locations, phenomenontypes, value types and time gaps. When a user sets three attributes from any threeboxes, the value of the other box can be generated based on the association rulesin the rule base. For example, if the rainfall attains a maximum value at 10:00 am,and we select the location as BenLomond, phenomenon type as Humidity and the

M. Zhang et al.

Fig. 9 Tree view structure ofthe sensor web

Fig. 10 Application of rules from data mining

value type as Maximum Value, then the rules will be invoked and it can be foundthat the maximum value of humidity will be at 1:00 pm on the same day. In addition,the application will give the support and confidence values of the association rule aswell. This application of data presentation is the extension of the rules from data-mining approaches in order to visualize the rules and make them easy to understand.In addition, it also provides a friendly interface to let user can manipulate rules andlearn know how rules work with the sensor data.

Figure 11 is the integrated interface for visualize data-mining rules and the con-ception of the hydrological sensor web. It has functions that indicate people to see andsearch related locations and phenomenon in SE sensor web. Also that allows users tomanipulate the interface to understand the rules.

Discover and visualize association rules from sensor observations

Fig. 11 The integrated interface in sensor web

4 Conclusion and future works

Data analysis and presentation plays an important role in Web-based environmentalmonitoring. In this research, we focused on using data-mining approaches to achieveknowledge discovery. We selected association rule analysis as the approach to dis-cover relationships and patterns among different sensor observations. Compared withtraditional conceptual models, the proposed approach has lower requirements on dataquality and domain expertise. In this paper, we chose SEHSW as a test-bed, and intro-duced the processes of data collection, data preprocessing and data analysis in detail.In addition, we also introduced our data presentation method which can visualize dis-covered patterns/rules in a meaningful way. The future focus of this research will bethe investigation of the applications of other data-mining techniques which can be ap-plied in Web-based senor systems. In addition, we will work on the modeling of userpreferences, in order to develop a more user friendly interface for data/knowledgepresentation.

References

1. Agrawal R, Imielinski T, Swami A (1993) Minig association rules between sets of items in largedatabases. In: Proceedings of the ACM SIGMOD international conference on management of data,pp 207–216

2. Arawal R, Srikant R (1994) Fast algorithms for mining association rues in large databases. In: Pro-ceedings of the 20th international conference on very large data bases, San Francisco, pp 487–499

3. Han J (1998) Toward on-line analytical mining in large databases. SIGMOD Rec 27(1):97–107

M. Zhang et al.

4. Ian H (2005) Data mining practical machine learning tools and techniques. Morgan Kaufmann, SanMateo

5. Jiawei H, Micheline K (2006) Data mining: concepts and techniques. Morgan Kaufmann, San Mateo6. Keim D (2002) Information visualization and visual data mining. IEEE Trans Vis Comput Graph

7(1):100–1077. Klein A, Lehner W (2009) Representing data quality in sensor data streaming environments. Proc

ACM J Data Inform 1(2)8. Liang X, Liang Y (2001) Applications of data mining in hydrology. In: Proceedings of the IEEE

international conference on data mining, pp 617–6209. Liu Q, Bai Q, Terhorst A (2010) Provenance-aware hydrological sensor web. In: The proceedings of

hydroinformatics conference, Tianjin, China, pp. 1307–131510. Mark H, Eibe F, Geoffrey H, Bernhard P, Peter R, Ian HW (2009) The WEKA data mining software:

an update. ACM SIGKDD Explor Newsl 11(1):10–1811. Open Geospatial Consortium (2007) OGC sensor web enablement: overview and high level architec-

ture. Technical Report OGC 07-16512. Su F, Zhou C, Lyne V, Du Y, Shi W (2004) A data-mining approach to determine the spatio-temporal

relationship between environmental factors and fish distribution. Ecol Model 174(4):421–431