situation awareness in web-based environmental monitoring ...interface demo to visualize the data...
TRANSCRIPT
Situation Awareness in Web-based
Environmental Monitoring Systems
By
Meng Zhang, MComp
A dissertation submitted to the
School of Computing and Information System
In partial fulfilment of the requirements for the degree of
Master of Computing
October, 2010
University of Tasmania
I
Declaration
I, Meng, Zhang, declare that this thesis contains no material which has been accepted
for the award of any other degree or diploma in any tertiary institution. To my
knowledge and belief, this thesis contains no material previously published or written
by another person except where due reference is made in the text of the thesis.
Meng, Zhang
II
Abstract
The Tasmanian ICT of CSIRO developed a Sensor Web test-bed system for the
Australian water domain in south-east of Tasmania. This system provides an open
platform to access and integrate near real time water information from distributed
sensor networks.
Traditional hydrological models can be adopted to analyse the data on the Sensor Web
system. However, the requirements on high data quality and high level domain
knowledge may greatly limit the application of these models. To overcome some these
limitations, this project proposes a data mining approach to analyse patterns and
relationships among different hydrological events. This approach provides a flexible
way to make use of data on the Hydrological Sensor Web.
To make data mining rules easy to understand, this project developed the user friendly
interface demo to visualize the data mining rules and describe the application of sorts
of rules.
III
Acknowledgement
I would like to thank everyone that has supported and helped on completed this thesis, especially my supervisors, Professor Byeong Ho-Kang and Dr Quan Bai (Tasmania ICT Centre, CSIRO). I am deeply indebted to your valuable, welcome, timely and plentiful feedback. Without you I would still be getting big headache to figure out my research work. Great thanks to you for sharing your life experience and brighten up my future.
I also would like to thank Mr Andrew Pratt (Tasmania ICT Centre, CSIRO) for providing the resource about the Sensor Web and supporting about the basic system about the operation of the South-Esk Sensor web.
Also, I would like to thank to Mr. Chris Macgeorge and Mr. Simon McCulloch from the Bureau of Metrology. They give me perfect feedback about the data mining rules and explain the domain knowledge in water phenomenon patiently.
Most importantly, I am forever indebted to my family members for their physical and mental support and encouragement throughout this year.
IV
Table of Content
ABSTRACT ......................................................................................................................................... III
1. INTRODUCTION .......................................................................................................................... 1
2. LITERATURE REVIEW .............................................................................................................. 3
2.1 SITUATIONAWARENESS ............................................................................................................. 3
2.1.1 SA in hydrological sensor system ....................................................................................... 3
2.2 SENSORNETWORKDATA ........................................................................................................... 4
2.2.1 The features of Sensor data .................................................................................................. 4
2.2.2 Relevance of the research .................................................................................................... 5
2.3 DATAMININGMETHODS .......................................................................................................... 5
2.3.l Clustering algorithms .......................................................................................................... 6
Related technologies for clustermg. .............. . .. . ........... .... ..... . .................... . .............................. 7
Aims for hydrological sensor web by using clustering methods.. . ..................................................... 10
2.3.2 Association rules ................................................................................................................ 10
Examples of association rules.......... . ................. . . .. ........... .... . . ................................................. I 0
Related technology ... ....... . .. .. .... .............. .. ............... .. . . .. . ....... . ... .... . ................. . . .. ........ 11
Relevant of the research..... . ... . ........ . .. ..... .... ........ . ...... ............ . . . . . ..... .. ................................. 12
2.4 DATA PRESENTATION AND VISUALIZATION ............................................................................. 13
Data type m visuahzation ................ .. .................................................................................. 13
3. METHODOLOGY ....................................................................................................................... 15
3.1 INTRODUCTION ....................................................................................................................... 15
3 .2 BUILDING THE DATA MINING MODEL ....................................................................................... 16
3.3 IMPLEMENTATION ................................................................................................................... 17
3.3. l Data collect10n and normalization ..................................................................................... 17
3.3.1.1 Data Collection .............................................................................................................. 17
3.3.1.2 Data normalization ........................................................................................................ 19
3.3.2 WEKA workbench based analysis tools ............................................................................ 20
3.3.2.1 Data preparation for WEKA .......................................................................................... 21
3.3.2.2 Using Explore to input dataset.. ..................................................................................... 22
3.3.3 Generate Association Rules ............................................................................................... 23
3.3.3.l Rules filtering ................................................................................................................ 23
3 .3 .3 .2 Rules analysis ................................................................................................................ 25
3 .3 .4 Rebuild the dataset and find more useful rules .................................................................. 26
3.3.4.1 Build up new dataset ..................................................................................................... 27
3.3.4.2 New rules analysis ......................................................................................................... 28
4. DISCUSSION ............................................................................................................................... 30
4.1 Introduction ....................................................................................................................... 30
4.2 Discussion with domain experts ........................................................................................ 31
4.2.1 Meeting with Chris Macgeorge ......................................................................................... 31
4.2.2 Meeting with Simon McCulloch ....................................................................................... 32
v
4.3 Implication for comparing different models ...................................................................... 33
4.3.1 Traditional hydrological model. ......................................................................................... 34
4.3 .2 Data mining approaches vs. hydrological model ............................................................... 34
4.4 Rule Presentation ............................................................................................................... 35
4.4.1 Interface design .................................................................................................................. 35
4.4.2 Implementation .................................................................................................................. 39
5. CONCLUSION & FURTHER WORK ...................................................................................... 41
5.1 CONCLUSION .......................................................................................................................... 41
5.2 CONTRIBUTIONS ..................................................................................................................... 42
5.3 FuTUREWoRK ....................................................................................................................... 42
6. REFERENCE ............................................................................................................................... 44
7. APPENDIX ................................................................................................................................... 48
VI
List of Figures and Tables
Figure 1-1: Conceptual Architecture of Sensor Web ................................................................. 1
Figure 1-2: Sensor distribution on the South Esk Hydrological Sensor Web ............................ 2
Figure 2-1: Data Clustering ....................................................................................................... 6
Figure 2-2: stages in clustenng ................................................................................................. 6
Figure 2-3: Merge and Split operation ...................................................................................... 9
Figure 2-4: the definition of the Cobweb algorithm .................................................................. 9
Figure 2-5: Information Visualization Techniques .................................................................. 14
Figure 3-1: The process flow of mining data .......................................................................... 16
Figure 3-2: Data mining process flow ..................................................................................... 17
Figure 3-3: The two different types of sensor record about rainfall event .............................. 18
Figure 3-4: Formula of calculating value gap ......................................................................... 19
Figure 3-5: Transfer data into nominal style ........................................................................... 20
Figure 3-6: The imtial ARFF file in hydrological data ........................................................... 22
Figure 3-7: Interface ofWEKAExplorer ................................................................................ 23
Figure 3-8: Distribution of support and confidence ................................................................ 24
Figure 3-9: The advanced version of the input dataset... ......................................................... 28
Figure 4-1: Simple process of traditional data model ............................................................. 32
Figure 4-2: Tree view structure of the sensor web .................................................................. 36
Figure 4-3: Related location and status in specific phenomenon ............................................ 37
Figure 4-4: Application of rules from data mining ................................................................. 38
Figure 4-5: The integrated interface in sensor web ................................................................. 38
Figure 4-6: Rules selection process flow ............................................................................... .40
Table 3-1: The description oflocation and phenomenon ........................................................ 19
Table 3-2: The result of association rules ................................................................................ 24
Table 3-3: The new result of association rules ........................................................................ 29
VII
Introduction
1. Introduction
The Sensor Web is a distributed information system that connects sensors, observation
archives, processing systems and simulation models using open web service interfaces
and standard information models (Open Geospatial Consortium, 2007). As shown in
Figure 1-1 , the Sensor Web is targeting at providing an open platform for users in
different domains and organizations to share their data, information and knowledge. It
can enable the integration of web-enabled sensors and sensor systems for a global
scale collaborative sensing.
The Tasmanian ICT Centre, as part of CSIRO Water for Healthy Country Flagship
project, has developed a Sensor Web system called the South Esk Hydrological
Sensor Web (SEHSW) (Liu et al. 2010). It collects sensor data from five water
agencies, and real-timely monitors the South Esk catchment, which covers an area of
approximately 3350 square kilometres in north-east of Tasmania. SEHSW makes the
South Esk catchment a sensor-rich environment with an opportunity to study the
hydrology phenomenon in a catchment and test-bed to study the development of
Sensor Web systems. Figure 1-2 shows sensor distribution on the SEHSW.
(
till'. -~-;l!l~'!lllh~·i*r~
.:;r
1,--··' • l-
Figure 1-1: Conceptual Architecture of Sensor Web
Introduction
Figure 1-2: Sensor distribution on the South Esk Hydrological Sensor Web
With the deployment of the SEHSW, huge amount of real time data are collected
from the physical environment and shared on an open platform. Meanwhile the value
of data needs to be exploited by some data analysis and knowledge discovery
approaches. In the water domain, hydrological data are normally manipulated and
interpreted by traditional hydrological models. These models are very powerful in
stream forecasting, water quality monitoring, etc. However, most hydrological models
have some limitations which can block their applications:
• Most of hydrological models ask for high level domain knowledge to generate the
structure and analyse data.
• Most of hydrological models require high quality data. Due to the particularity of
water related data, the real water data ask for the integrity and veracity. The data
also need to manipulate accuracy during the process.
• Most of hydrological models have the specific region.
Compared with hydrological models, data mining provides a more flexible way to
achieve knowledge discovery (Liang et al. 2001). It can analyse hidden patterns and
relationships among data from a data-driven perspective. Namely, it can be operated
based on available data. In this paper, we propose the use of association rule based
approach to analyse hydrological data. The proposed approach can interpret data on
the SEHSW to useful information without high domain expertise.
2
Literature Review
2. Literature Review
T his chapter will describe some issues review which focused on the understanding of
situation awareness and sensor network data. In addition, data mining methods and
data visualization related information also take into the consideration. It discussed the
related knowledge and selected suitable methods to become the points of this
research.
2.1 Situation awareness
According to Endsley's definition (1995a), Situation Awareness (SA) is "the
perception of the elements in the environment within a volume of time and space, the
comprehension of their meaning, and the projection of their status in the near future".
For environmental monitoring systems, SA means that to find ways to make
improvements for the explanation of the environmental systems. SA can be described
as a theory, an activity, a product or an understanding process (Durso &
Sethumadhavan 2008). Some people have different views about the definition of the
SA. For example, according to Garsoffky, Schwan and Hesse (2002), the video of
soccer goals can give the participants different impression to recognize the memory of
the details in that radio. Hence, researchers in different domains may hold different
opinions and interpretations about SA. In this project, SA is the specific awareness
related to events or phenomena in the nature environment.
2.1.1 SA in hydrological sensor system
Different people may hold different opinions about situations in the water domain. For
example, bush walking amateur may interest on extreme weathers such as snow and
storm. Water managers may interest on related phenomena related with catchment
statuses, e.g., humidity, soil moisture, etc. In this research, we focus on the
interpretation of sensor data, so that more useful information can be provided to users
3
Literature Review
to identify situations based on their requirements.
2.2 Sensor network data
With the development of sensor techniques, various kinds of sensors involving video
cameras, thermometers, obstacle sensors have been used in different domains (Ikuhisa
et al. 2009). Sensors can be installed by different companies or organizations. The
data collected by sensors can not only used for the initial purposes of the sensor
owners exclusively, but also many other people if the sensor data can be shared and
published to the public. For an instance, observations of a weather station can not only
provide useful meteorological information like wind speed and air temperature, but
also road safety related information such as the melting speed of road surface ice
(Anthony & Vinny 2004). Take another example, a calculagraph can be used to
identify and respond the changes in spatial context in mobile computing devices
(Anthony & Vinny 2004). Moreover, data collected by sensors can be processed,
interpreted in different ways, and as a result, different interpretations may fit for
purposes of different users (Peter et al. 2005).
2.2.1 The features of Sensor data
Observations from sensors can be generated and updated very frequently. As a result,
it is a big task to navigate and manipulate sensor data (Kiyoharu et al. 2004).
Especially for large scale sensor networks, a wide range of sensors, which can deliver
numerical, discredited and digitized data streams, are used to monitor and manage the
nature systems (Klein & Lehner 2009). In a sensor network, a sensor node may only
possess very limited sensing and computing capabilities. Sensor nodes in a sensor
network can organize and communicate themselves with others to transmit power
control and antenna in order to adjust the transmission. Namely, sensor nodes can
work individually for simple tasks like collecting audio, seismic, and other types of
data, and then higher level tasks can be achieved through collaborations within a
sensor network (Stephanie 2001) .
4
Literature Review
Sensor networks can be used to observe complex phenomena and collect high level
information. Environmental monitoring and event detection are two basic applications
of sensor networks. Environmental monitoring can collect and utilize the data in
scientific or other applications. On the other hand, event detections can only focus on
detecting specific events. The data in sensor network is expected to be correlated in
both spatial and temporal domain (Kevin et al. 2009). Sensor networks normally can
collect huge amount of data which makes data gathering an important issue. In a
sensor network, the proposed compressive data gathering can reduce the
communication cost of the global scale without including intensive computation or
complex transmission control (Chong et al. 2009).
2.2.2 Relevance of the research
Sensor networks can provide huge amount of sensor data, but all of them are raw data.
The purpose of sensor network is to provide useful information for users from
collected data which means the ways to get information from data are important. In
addition, it is also necessary to find suitable ways to disposal raw data and find the
relationship between every sensor from the data.
2.3 Data Mining Methods
There are a number of definitions about data mining. In this research, we adopt the
following data mining definition: data mining is a practical topic and concludes
learning in a practical to find and describe structural patterns of data like tools to help
explain the relationships between data and make some predictions from it (Ian 2005).
Data mining applications are widely applied for detecting fraud, assessing risk, and
product retailing. Data mining allows discovering previously unknown, valid patterns
and relationships in large dataset by using of data analysis tools (Su et al. 2004).
For many domains, data mining methods play important roles especially in the aspect
of data classification, quality assurance and pattern discovery. Data mmmg
5
Literature Review
approaches can be included in data warehouse environments (Han 1998). These
methods provide a generic way for data analysis (Su et al. 2004).
2.3.1 Clustering algorithms
Clustering techniques (see Figure 2-1) can be used without predefined classes (Ian
2005). According to JAIN, MURTY & FLYNN (1999), clustering is an unsupervised
classification of information or data items based on their features. The outputs of
clustering can be expressed different ways. Namely, one instance can belong to a
unique cluster or a number of different clusters (Ian 2005).
>'
' > '
"
' ,,, >
Hl
' '
x 1'. '
' j
J i
'
~
j
6 ,,
(i>I
Figure 2-1: Data Clustering
~
Data clustering normally can be achieved in three steps, 1.e., feature selection,
inter-pattern similarity and grouping (see Figure 2-2). Feature selection is to identify
the effectiveness when clustering data and feature extraction to generate the salient
features. Pattern presentations are meant of the quality and quantity of the patterns.
Grouping is to divide the data into suitable patterns (Jain 1999).
Pattt..'rns Feature Paticm Intt:rpalh:m Clui;tc Sekctfon! SlmUarity Grouping Ext1adion Repren11t11ti<mi~
t t l
Figure 2-2: Stages in Clustering
6
Literature Review
In general, the clustering methods can be used to normalize data and transfer
continuous data into nominal data type. It is obvious that the clustering can improve
the quality of data and divide the data into reasonable groups.
Related technologies for clustering
To cluster the data, it needs to set some mechanism to manipulate the data. One of the
classic clustering methods is called k-means clustering. k-means clustering is
normally operated in the following steps (Ian 2005):
1. Define a parameter k which indicates the number of clusters;
2. Based on the ordinary Euclidean distance metric, all distance can be assigned
to their closet group centre;
3. The centroid of instances is calculated which called the process of "means".
4. New cluster centres can be generated.
However, the problem of k-means is exist which is about the number of clusters and
the location problem. One of them is to minimize of the sum of distance of the nearest
centre called Euclidean k-medians, another one is to minimize the maximum distance
from every point to its closets centre called k-centre problem. To solve these problems,
the k-mean algorithm is generated which is based on a simple iterative scheme for
finding a locally minimal solution (Tapas 2002).
To compare with k-mean algorithm, another effective method called EM
(expectation-maximization) algorithm can have a better performance in statistical
estimation problems which involve incomplete data or the problems like mixture
estimation (Borman 2004). In that algorithm, "expectation" means to calculation the
cluster probabilities which can be the "expected" class value and the "maximization"
is meant that to calculate of the distribution parameters which is the data given by the
likelihood (Ian 2005). So EM algorithm is the good way to calculate the Maximum
likelihood estimate (Borman 2004).
7
Literature Review
There are a number of applications of the EM algorithm. In some applications, the
EM algorithm is used to estimate data sequences transmitted instead by continuous
phase modulation in a multi-path channel which concludes a random phase, amplitude
and the time delay. In this situation, the simplified EM algorithm makes the
maximization of desired likelihood function about the adequate amount of training
data by some general formulation to do explicit calculation to find the estimation and
provide the likelihood (Linda & Hisashi 2002).
There are several variants can be happened when use the k-means algorithm. Firstly,
the resulting clusters can be splitting and merging. Secondly, the clusters can be split
in a pre-specified threshold and the two clusters may be merged when the distance
between these two clusters is lower than another pre-specified threshold (Su et al.
2004).
Both k-mean and EM algorithms has specific requirement for data. On the other hand,
there is another specific algorithm is to build a probabilistic hierarchical tree by the
CU (category utility) (Mi 2004). Based on the computation of the probabilities, when
the category utility is maximal, the every instance can be read per iteration from the
dataset and the algorithm can incorporates instance in to the trees by descending the
tree with four possible operations at each node (Mi 2004).
~ Put the instance into a exist cluster
~ Create a new cluster
~ Merge the best two clusters with respect to the values of CU.
Split one clusters into several clusters by lifting its children one level in the tree to
replace it.
8
Literature Review
1
/\ 2 5
1
,/\ 2 l
II\
Figure 2-3: Merge and Split Operation
The node split and merges for the higher CU situation is displayed by Figure
2-3(Nachiketa et al. 2006).
,.~ ..... -~{ //,.fnternal Node and''
l using CU(see the ', \., followmgfour }
-:,,::...'"-' .... ""« operations) ,.,_,.,...#"""" -.,,M"-'F"'"«'~~
' '
NO '
...,.....-~;_1$'.-,.,,y_,._..., ... ~
~Create New leaf-"",, / node and add both ~)
old and new to new .. · ,~·
n~~;~,~-"'{""
The condition is satisfied? 1
YES
Figure 2-4: The Definition of the Cobweb Algorithm
Figure 2-4 shows the common steps of the process of the Cobweb algorithm. The tree
in this figure can be constructed above those steps for every instance. So this
algorithm may output a massive hierarchy for large datasets (Mi 2004).
9
Literature Review
Aims for hydrological sensor web by using clustering methods
Normally, the clustering methods are used for initial implantation during the whole
process. Clustering methods can normalize or discrete continuous data. Considering
that most of complex mining methods have data requirement with nominal type,
clustering methods become the basis technique to disposal data as for the first step. In
this research, we are going to take the first step in the clustering data which is
collected by the sensors. After collecting the raw data from sensors, we prefer to use
some clustering method to help us normalize the data and build up the suitable dataset
as for the input and get ready to use more complex process or methods to manipulate
the data and find more potential relationships or patterns.
Based on the introduction of some clustering algorithms, at the first step of using data
mining methods, it is important to select suitable clustering methods to normalize data.
However, this research does not focus on comparing algorithms or creating new
algorithms so that the target is to get the nominal type of data rather than spending
time on selecting algorithms or improving specific algorithms.
2.3.2 Association rules
An association rule is to find the potential relationship between data and make the
rules to predict the class or instances of the dataset (Ian 2005). The classification rules
is similar with association rules which can be used to make prediction for class of the
dataset. However, the biggest difference between association rule and classification
rule is that the association rules can make prediction for the attributes of the dataset.
What is more, the amount of the association rules may be massive and some of them
are related so that the rules need to reduce as well (Ian 2005).
Examples of association rules
A classic example 1s the market-basket problem (Arawal & Srikant 1994). In
10
Literature Review
transactional data-sets of supermarkets, rules like {bear }-7 { diapers} means that most
of customers buy bears, meanwhile, they may buy diapers as well. Such rules will
then suggest the supermarket managers to put bear and diapers together which may
improve their profits. From this example, it can be seen that association rules provide
a useful mechanism for discovering relationships among the underlying data. Also
according to the Toivonen (1996), this kind of example can help manager to manage
the situation of related goods and help customers to find their interesting products
easily. In addition, those rules or relationships can improve the profits for the
supermarket.
From this example in supermarket, it is obvious that the association rules can help
firms to find the potential relationship in related goods and managers can make good
planning and decision to distribute goods to improve profitability. Also for the sensor
web, association rule mining can generate related relationships to help domain experts
or normal user find more information from sensor data. The input of association rule
mining is the data which requires lower quality than hydrological model and the
output is rules. Those rules describe the relationship between data and hydrological
events.
Related technology
In this research, we define a rule is defined as a form A = {L=> R}, where A is an
association rule, L and R are two disjointed item sets of event E (L n R = </J ). L is
called the antecedent of A; and R is called the consequent of A. There are two
important constraints for selecting useful rules, i.e., the minimum thresholds on
support and confidence (Jiawei & Micheline 2006). Support is the ratio of the number
of transactions that include all items the antecedent (S(L, R) = P(L)). Confidence is
the probability of finding the consequence in transactions under the condition that
these transactions also contain the antecedent (C(L, R) = P(L I R)). There is another
threshold called frequent which is calculated by support and confidence. This is the
11
Literature Review
third significant metric for association rules and named lift in association rule mining.
Lift value can be calculate based on confidence and support: lift (X => Y) = confidence
(X=>Y) I Support (Y). Only rules with lift value greater than 1 can be considered into
good rules. There are a number of learning algorithms which can find association
rules based predefined support and confidence thresholds.
The support and confidence is the key standard to justify the quality of rules. The
rules with both high support and confidence can be called useful roles. For example,
in the supermarket, the rules: bread=> rose has confidence value over 90%, however,
with less than 0.01 % of support. This means that about one of 10000 customers may
buy both bread and rose together. It is obvious that this rule is useless for its frequency.
If managers put bread and rose together, to compare with separating them in different
locations, there will be nothing happen. Also the rules with high support and low
confidence cannot be used into applications. So the key point to select useful rules is
to set up the support and confidence value at initial time before utilizing the
association rule mining and select the high level rules with both high confidence and
support values.
Relevant of the research
In this research, we do not focus on the algorithms comparison and discovering new
algorithms. Based on previous description of situation awareness, we focus on
building models to make sensor web and sensor data meaningful, thus the association
rules can become the aim to help us to find the relationship or patterns in sensor data.
Because association rules mining is focus on finding potential relationships in data
which is the target for the sensor web. So we set this method as a component of the
model. If the result or relationships are not useful for sensor web, we can find other
methods instead of association rules which can be the future work for this research.
12
Literature Review
2.4 Data Presentation and Visualization
In order to facilitate users to understand discovered knowledge, data presentation
becomes to an important issue for the success of the approach. Data visualization
tools or technologies can allow people easy to identify the interesting information or
patterns by creating dimensional pictures or views. It is the keys to help people to
make decisions about the quantities data (Soukup & Davidson 2002). According to
U sama, Georges and Andres (2001 ), visualization can be used to explore different
kinds of data, allow users to become a viewer and make complex data reification.
Also, data visualization technology can prove to explore data analysis in large
database especially when little is unknown about the data and related meaning. (Keim
2002)
It is a suitable way to combine data mining method and data visualization technology
together to make application in hydrological sensor web in this research. Data
visualization provides a flexible way to present the analysis and the result of data
mining methods. Visualized data also can make communication between users,
hydrological events and domain experts.
Data type in visualization
According to Keim (2002), different types of data have different methods to visualize.
Figure 2.5 describes the general techniques for data visualization in different data
types. It is easy to find that the data visualization methods can fulfil almost amounts
of data so that it can not only present the sensor raw data in sensor web, but also
visualize the relationship between data and hydrological events.
13
Literature Review
· \ 'isullllization Tecllnique
!rorut: Dispkw /
/ ~·'t}eomeufa;:'l!!y-tr1icn~fom1~i Di1:ifila} ,. ,.
6 a!~·z•n1i:n1/sofw.411#- ?'&anciord JD!3D Display
~drutl Projootton F 1h~ring Zoom Dlstort!ilfl Lmk&Brush
lll111.'fttttfo1111nd lli~to1till«1 ·1ed1nitlut'
Figure 2-5: Information Visualization Techniques
The target of data visualization is to simplify the complex or huge data and transfer
into friendly interface. In addition, the visualization achievement is justified the user
preference model. User preference model is always required some AI technique for
input in general (Linden & Hanks 1997), but it is not good for complex domains. So a
reasonable approach is to build communications to allow user to support the decision
and filter the information (Thomas & Fischer 1996). To build up the user preference
model can also help both domain experts and normal users to understand complex
process or information easily.
For this research, to build up the visualization interface and user preference model is
the advanced awareness in the hydrological sensor web which can help people to
make fully understand for hydrological events and the relationships. It is also the
improvement for the data mining methods and allows data mining results to be
visualized.
14
Methodology
3. Methodology
This chapter describes the methodology of the research and related approach to
implement this research. The methods of the methodology and the visualization of the
result are based on the related literature as shown in last chapter (Chapter 2). The
main highlight of this research is to use mathematic mechanism to find the potential
relationships or patterns in hydrological related sensor data without the domain
knowledge.
3.1 Introduction
The main idea of this project is to find the potential patterns in hydrological related
data which is collected by the sensors. In order to calculate and analyse the data, data
mining approaches provide the flexible ways and can be used. Data mining focus on
the data and it does not require the data with high quality which means that the data
can be not continuous which cannot make any influences to the result.
The approach of the data mining for this research was to use some of the methods to
analyse data and give the result about the potential relationships or patterns of the data
which can help people to understand the hydrological phenomenon used to facilitate
water resource management, event predictions and catchment monitoring.
The objective of this research is to develop an effective and generic approach about
the hydrological system and the approach can be used in any monitoring systems and
generated by non-domain experts or people with a little domain knowledge.
Based on the miming methods and sensor data, the process flow of this research is
illustrated in Figure 3-1. In the approach, hydrological data are transformed into a
proper format and sent to the association rule analysis component. Then a number of
15
Methodology
association rules will be generated from the rule analysis component. These rules will
be pruned and stored in a rule base. Finally, the stored rules can be presented based on
users' queries.
Data Ool lection & Normah;;:aiii:m
A'i>-'"fldatkITT Ruf<! Mining
Figure 3-1: The Process Flow of Mining Data
The scope of using data mining methods and finding useful results was constrained in
the following ways:
>- Focus on the collecting and normalizing sensor data from the sensor web and
generating the suitable format of the data before using mining methods.
>- Focus on the result of the mining data methods and analyzing the result
rather than comparing different algorithms
>- Focus on the data visualization of describing the result of mining methods
and which way the result can be used.
3.2 Building the data mining model
To compare with the other methods, data mining methods provide the flexible way to
disposal hydrological data. Figure 3-2 shows the process flow of the data mining
application. In order to build up the initial dataset, we need to set up the purpose to
select different types of data from sensor web. The initial dataset concludes the
sp.ecific spatio-temporal hydrological events or phenomenon. Then, because the
association rule mining requires input data with nominal type, we need to discrete
data from initial dataset and transfer the continuous data into nominal type and build
16
Methodology
up the nominal dataset as for the input. After that, we can use association rule to
generate related rules. Next we use the standard value of support and confidence to
justify rules. Finally we filter the useless rules and put the useful rules into the rule
base.
f ~/
Rule . Base
Hydrological Data
collection
Rules filtering
Initial \ dataset
'-~---
~-=·'"", Association I rule m ining '
Figure 3-2: Data Mining Process Flow
3.3 Implementation
3.3.1 Data collection and normalization
Nominal dataset
Based on the features about the sensor data (lkuhisa et al. 2009), the sensor data are
huge and update frequent so that it needs to find ways to collect data and build up the
suitable dataset before doing the data analysis. Therefore the data collection and
normalization are important component during the process of the mining data.
3.3.1.1 Data Collection
We are expecting to discovery spatio-temporal patterns in different hydrological
events. In this research, we are particularly focus on analysing the relationships
between rainfall events and other phenomena, e.g. , humidity, air-temperature, etc.
To achieve the purpose, firstly, we record the maximum or minimum values of rainfall
events from the sensor web, and extract them into a new database. To record the
rainfall event, there are two different sensor records as shown on Figure 3-3 . The left
17
Methodology
one sensor records the rainfall as a big bucket and the value will be recorded and
improved when rainfall happens so that the value of this sensor is always increased
until it reset. To catch the maximum value of rainfall in this sensor, we just calculate
the time gap between each changed value and the lower the time gap is the higher the
rainfall value is. Based on that phenomenon, we can catch the maximum rainfall value
time.
The right status is another type of sensor to record the rainfall value. At this moment,
this sensor records the value like a small test-tube which has small capacity, when this
test-tube catch the rainfall inside, it will update every settled time and record the
rainfall value at that time. In order to catch the maximum value in this sensor, we can
record the maximum record value and the maximum update frequency.
_ R!M7499020TBU(mm}
,r- ::.,~
A <<~ 'i
" '~,:.¥.:!L' {,;
Figure 3-3: The Two Different Types of Sensor Record About Rainfall Event
After recording the maximum time of the rainfall, in addition, to simplify the
implementation, we also assign an index to each location and phenomenon type. Table
3-1 describes index values for different location and phenomenon types. In this table,
location 1 represents the locations that a corresponding phenomenon may happen (e.g.
Humidity). Location 2 represents the locations of rainfall events.
Location 1 Index (J) Phenomenon Index (K) Location 2 Index (L) English
BenLomond 1 Humidity 1 Town 1 Road
Story Creek 2 Air-Temperat
2 Valley
2 ure Road
18
Methodology
Ben Ridge 3 Evaporation 3
Hogans 3
Road Road
Avoca 4 Transpiration 4 Mathinna
4 Plains
Tower Hill 5 Wind-Run 5 Tower Hill
5 Road
Table 3-1: The Description of Location and Phenomenon
3.3.1.2 Data normalization
From Table 3-1, we can calculate the time gap between the maximum value or
minimum value of a rainfall event and another event within the same day by using
Equation below.
Figure 3-4: Formula of Calculating Value Gap
Then, we can get two sets of time gaps which indicate the time differences between
the rainfall event reaches the maximum value and another event reaches the maximum
and the minimum values. The time gaps are described in a continuous data type.
However, association rules can only take nominal or ordinary data types as inputs. To
satisfy this requirement, the continuous values need to be transferred to a nominal
style. Here, we use a simple clustering technique to achieve the conversion as
described in chapter 2. Figure 3-5 shows the clustering method we used. The method
transfers the continuous values into nominal items by generating different clusters.
Each cluster contains a range of continuous time gaps. For instance, the cluster
(Max_Gap(0_ 4)) covers the time gaps between 0 to 4 hours.
19
Methodology
c. Normilizing
~ c. .. ..
1:1, !;>, ... ... E E i= i=
Items Items
Figure 3-5: Transfer Data into Nominal Style
After that, we finish collecting and normalizing hydrological data and build a new
!database to store the nominal data. This primary database concludes the gaps between
rainfall and other phenomenon, the locations and the phenomenon. The next step is to
select a suitable platform or workbench to analyse data by using association rule
method and find the rules about the relationships or patterns in hydrological event.
3.3.2 WEKA workbench based analysis tools
We described the processes for data pre-processing in the previous subsections. After
that, we need to operate association rules mining on a platform or workbench. There
are a number of data mining tools which can conduct association rule mining. In this
project, we select WEKA workbench to analyse hydrological data.
WEKA(stand for Waikato Environment for Knowledge Analysis), was developed at
the University of Waikato in New Zealand, provides a uniform interface to lots of
algorithms for pre- and post processing and for evaluating the result of the learning
schemes on any dataset(Ian 2005). This work bench can not only provide algorithms
to analyze data, but also allow the researcher to advance the existing algorithms or
implement the new algorithms without considering any support infrastructure(Mark et
al. 2009).
To compare other platform, WEKA provides an open interface which allows users to
manipulate it easily and flexible support algorithm enhancement. We can achieve our
20
Methodology
target via WEKA workbench by using existing functional interface. In addition,
WEKA provides opportunities to make enhancement of existing algorithms which can
improve the continuing work of this project by using data mining methods in
hydrological domain.
3.3.2.1 Data preparation for WEKA
WEKA requires a specific input file format named ARFF which is the best format. It
also allows other formats of data file like .csv or C4.5 (Hall 2009). To analyze the
hydrological data, we select to convert the normal data from database that we created
in previous section into ARFF format file. There are three major components in ARFF
files. The label "@relation" is used to make a simple description about the dataset;
"@attribute" refers to the titles of each item; "@data" means that after this label all
data are values of items in transactions.
In order to using WEKA workbench, we need to transfer the normal data format into
ARFF file. As shown in Figure 3-6, the data in brackets (after each attribute) indicate
the possible values of that attribute. The data after the label "@data" are values of
items in transactions. At this moment, we set the initial dataset with five attribute, the
location, item, compare _location refer to Location 1, phenomenon, Location2 that
shown in table 3.1. The max_gap and min_gap refer to the time gap which is
calculated by previous formula. Then each instance describes the values of each
attribute and every time gap value of instance is based on same day. For example, the
first instance means in Ben Lomond, the maximum value time of wind speed is 6 and
half hours earlier than the maximum value time of rainfall in Rabbit Marsh. Also, the
minimum value time for wind speed is 4 and half hours earlier than the maximum
value time of rainfall in Rabbit Marsh. In addition, both gap happened in the same
day.
21
Methodology
/ test_data.a<ff X L
~ I I I I I I I I 1,0, I I I I I I I 2,0, I I I I I I I 3,0, I I I I I I I 4P, I I I I I I I s,o, I I I I I I I 6,0, I I I I I I I 7,0, I I I I
!. \thi!' de.t:..a.!!e-C i~ to. rcco·rd 'the tir!e: -gap between ~om.c item.;!!; ~d xa.ind:all. 2 :\all data is based en :C: 4 hours !Series 3 @relation ' wate~'
4 @aetr.ibute: lo.cation {3enI..c:n.cnd,St::orysCreek,BenRidgeRoa:d,Avoca ,IowerHill, Co x 5 @attribute i.cem {Wind3peed., ii.Ul'!'.idity, airterr:peratare 1 evapcrat.iO!l.1 trar.:spiratio ~ @actri.Oute compare_1ccation {3enLamcnd,Story~Creek,BenRidqeRoad,Avoca , Tower
7 @attribut.e m?S X_Qap real a @attribute lti.n_9ap real g @data
10 BenLcn;.cnd, ~indspee:d , Rahbi tMarsh, - 6 . 5, -43 • S 11 Ben.L;::mcnd,.Hu.m.id.iey, Rabbitilar~h, 6. 5, 10. S 12 BenLcs;.ond, airterr.perature, Rabbit.Ma.:!"sh, 2. 5 , l 13 Be~Lc~cnd,Hu."1ll.dit.y,Rabbiti-Iarsh,l1 . 5,-3
14 St:ory~Creek,. airt~peratu.re: , Rabb-ict4arsh, 2 . .5, 11.:. 15 Sto::ysC::eei<:. evapcr·aticn, Rabbit.Marsh , 0 . 5, 0 lt St:~ry!J-Cre.e.k,.t.=:an~piraticn.RabbitMarsh,-3 . .5,4 . .S 17 BenRidgeRcad,HUJT~id1ty,Rabb1t~~rsh,4 . 5,D
18 BenP.idoe.Road, airt.empe.=:atu=:e, Rabb.:..t1!1arsh, O, 1 19 Ben:RidgeRcad,evapcracion,Rabbit.Ma:reh, - 0.5,D 20 Be nRidge.Road, tran.3pira<::ion, Rabbit.Ma:reh, 4 . . s, -12: . 5 21 ;:..voc.a,H;.i'nidity, RabbitMa::sh, -5. 5, 3
Figure 3-6: The Initial ARFF File in Hydrological Data
3.3.2.2 Using Explore to input dataset
There are four components in the initial interface of WE.KA, based on the description
of chapter 2, for this research, we prefer to use Explorer to generate the association
rules. Figure 3-7 displays the initial interface of setting input value in WE.KA explorer.
The WEKA Explorer provides a friendly interface to realize different kinds of data
mining methods (e.g. Classify, Association rule. etc). To manipulate data mining
methods in Explorer, we need to select suitable dataset and select target methods and
set up the target options of method. Then we can get the result of analysis on that
interface directly.
(tuT"u1.t rslst·io11
:RtlatiO:ll . ••hr 1r.:n.-.:M.i : t.tJ
.\H:-ilratn · 5 Smt of u i U,ts . ·~3
Stltch.i slln\n.1.lt
lrlll'lt : COl!!J&rt_louli oti. !1huin,: 0 (Cl) Dufrnct : 1'.
Iypt : !f°"'in.al Uni-;ut : {} ((:'%)
22
Methodology
Figure 3-7: Interface of WEKA Explorer
We had made the ARFF format dataset file so that in this step, we just open the ARFF
file we had made and select the target method to make analysis. For this research, we
select the association rule method to analyse data and find the relationship or patterns
in hydrological data. WEKA Explorer provides several algorithms to support and
analyse the association rules. Based on the discussion of Chapter 2, we select the
classic algorithms called Apriori algorithms. Because we are not focusing on the
algorithms analysis which can be another big issue to discuss and analyse in this
project, we just select one of the normal methods which are suitable for hydrological
data analysis. Before we generate rules, we need to set up related options by using
Apriori algorithms. For this experiment, we set the support threshold as 10% to 100%,
and the confidence threshold as 0.5. Due to the features of hydrological data, the
confidence threshold was not set to very high.
3.3.3 Generate Association Rules
For this implementation, in this research, we build up the dataset that concludes 443
instances to generate related rules. During this process, we input the 400 instances to
generate rules and set up 43 instances (10% of total) to evaluate rules.
3.3.3.1 Rules filtering
As we mentioned in chapter 2, there are two important attributes, which named
support and confidence, used to justify the quality of rules. After generating the
hydrological data, we had got lots of rules to describe the relationship of the
hydrological events. Different rules have different support and confidence value.
Figure 3.8 shows the distribution of confidence and support for 17 rules. The X axis is
the confidence value and the point is located into the related support value. The total
number of instances is 443 and Y axis shows the support value. For instance, the
23
Methodology
support value is 125 means that there are 125 instances satisfying the rules. We
filtered the rules which have big gap between support and confidence value.
We had known that the rules will be useless if the gap between support and
confidence value is large. For example, the rule: {(item = rainfall, location =
StorysCreek) =>(compardlocation = Avoca, max_gap = [3-7] )} have 100%
confidence value but with 1 of support value which means in 400 instances, only 1 of
them support this rules. It means that this phenomenon is the rare event and the
related rule is useless.
Distribution
250
150 -i--~~~~~~~~--c,--~~~~~~~~
'*' '*' • 100 -i--~~~~--~~~-~-~--•,,...-___,,~-~~~~-•••
+-Support ,
50 --------- --- -- ---------- ----- ----- ------------- -~----- ---- - --- --------- - - --
"*' '* dL. 0 ..+--~~~~~~~~---r~~~...---~__;:;iw-~.........-,
0% 20% 40% 60% 80% 10'0% 120%
Figure 3-8: Distribution of Support and Confidence
After the filtering all of rules, we have got 10 rules as for the result which have more
than 1/3 of total instances value support. Then we use another 44 instance to justify
the rules. Table 3.2 shows the result and evaluation of association rules generation.
Rules Accuracy ( 40 instance evaluate) Total number 10 Average 80% With confidence (> 80%) 4 85% With confidence ( <80%) 6 75%
Table 3-2: The Result of Association Rules
24
Methodology
3.3.3.2 Rules analysis
After the generation, we get some rules for hydrological data. The rules are
basically related with phenomenon types and time gaps. Based on identified rules, we
can find the relationships between some specific phenomena (e.g. rainfall and
humidity), and furthermore, make some predictions on hydrological events. Here is
the following are some typical rules examples:
Rule= {(max_gap:[3-7]) =>(item: humidity)} conf 0.83
This rule means that regardless the location, the humidity should get the maximum
value 3 to 7 hours later than rainfall get the maximum value. For instance, if a rainfall
event gets the maximum value at 9:00 am, the humidity will get the maximum value
between 12 :OOpm and 16: OOpm. The support of this rule is the 20% of max _gap and
40% of humidity. This rule has the high confidence with 83%. So based on this rule,
we can find that the phenomenon types and time gaps between humidity and rainfall.
In addition, we can make prediction about the maximum time of humidity via the
maximum time of humidity.
Rule= {(min_gap:[(-5)-(-2)]) =>(item: airtemperature)} conf 0.675
This rule shows the relationship between minimum value time of air temperature and
the maximum value time of rainfall. It means that regardless the location, the air
temperature can get the minimum 2 to 5 hours earlier than rainfall get the maximum
value. The related confidence is 67 .5% which is lower than previous rule.
Compare with those two rules, they have similar content and different confidence
value. Both of them are regardless the locations and focus on specific phenomenon
with related values (max or min). Then, the different confidence means the different
probability. It is obvious that the relationship between humidity and rainfall 1s
25
Methodology
stronger than relationship between air temperature and rainfall. Besides, there are
some other related rules that conclude relationship with locations.
Rule: {(Item= airtemperature), (location= BenRidgeRoad) => (max_gap[l.5-9]) conf
0.62
This rule has built up the relationship in most of attributes: location, item and value
type. !he detail information of this rule is that in Ben Ridge Road, the gap between
air temperature of maximum value and rainfall of maximum value is 1.5 to 9 hours. It
is easy to find that the range of time gap is large than previous rules because the
related items in this rule has expanded. Also, the confidence value becomes lower
with only 62%. In addition, this time range (1.5-9) almost covers 1/3 daytime which
means the if we want to get more accurate value from this rule such as make the time
range smaller, the confidence will decrease sharply even lower than 40%. So this kind
of rule is not as good as previous even it concludes more items.
In conclusion, based on that dataset, most of rules can build the relationship between
phenomenon and value type with different confidence. However, if the rules contain
more items, the confidence will become lower and the time range will become larger.
It is obvious that those rules with more items and high confidence are more useful and
meaningful. To improve the quality of result, there are several ways: improving the
algorithms or change the input dataset. In this research, we are not focus on the
comparing related algorithms, thus we prefer to improve the input data set.
3.3.4 Rebuild the dataset and find more useful rules
In order to generate more useful rules, we prefer to rebuild the input dataset. There are
five attributes in the initial dataset: two parts of locations, one item and two value type
(max and min). It is hard to build the stronger relationships between locations and
items. Then we reconsidered the component of the initial dataset and we found that
26
Methodology
the locations are distributed separately in this dataset. For example, the same
phenomenon in the same terrain location can get the similar maximum or minimum
value time. Besides, the different location in different topography shows the different
status in same phenomenon.
Figure 1-2 describes the distribution of locations in hydrological sensor web. In that
Google map, different colours mean different topography. It is easy to find that
locations are distributed in different topography. So we can combine the same
topography locations into same group and generate several groups of locations to
build up new dataset. Then we can generate new rules to find whether the new dataset
can show stronger relationships with locations.
3.3.4.1 Build up new dataset
According to the distribution of locations, we can build up new dataset with additional
attributes. In this project, we divide those locations into four different groups due to
the topography. Based on table 3.2, we generate two new titles named locationlarea
and location2 area to represent the meaning with groups of locationl and groups with
location 2. The four level group of location are generated by the altitude of the
topography. The highest locations are named as area 1 and lowest locations are named
as area4. Figure 3.9 shows the new version dataset which concludes 7 attributes.
27
Methodology
1 ti:ti.s, data~et ~~,to repor~ t~e-~~ gap becwee~~~~.e,,items :i. %a1.1 data is based on 24 houra 9e.::-ies 3 @re:2a~ian 'water' 4 @:attribute lcca-::ian {BenLoir.ond,. S...::c=ysCreek,. BenRidgeRoad,Avoca,. !'owe=H:ill, CcyJ 6 @attrib~te item {W1ndspeed,Humidity,airtempe=atuxetevapa=aticn,transpiratio1 1S @at tribcte compare .location tBen~omcnd, StorysCreek, BenRidqeRoad,AYoca, J'OT;.;er~ 7 @attribute max_gap-real a @attrib:>te mir. _ga>0 real
i 9 @attribute: lcca-=:ianlarea {1, 2,. 3, 4} 10 @attribute lcca-::icn2area {1,2,3,4} l1 @data :2 BenLamand,~indspeea,RabbitMarsh,-6.5,.-4.5,3,3 ~s B~~LQmo~d,Humidi~y,RabbitMars~,6.5t10.5t3,3
:..--1 Be:::::!.o:mcnd, a.irtern.perature., Rabbi tHarah.I' 2 .. 5, 1, 3, 3 :..s Ben:.cmond,!itud.dity,RanbitMarsh,.11.5, -3, 3, 3 ~6 5tcrysC=eek,air~emperature,Rabb1tMarsb,2.S,1:.5,3,3
17 SccrysCreek, evap-orat1on, Rabbi tl.,.1arah,. O. 5, C, 3, 3 ::e StcrysCreek,. trar . .3p1..ra~l.o'!2, Rabl:-it!4a:sh.1 -3 .5, 4. S, 3, 3 l9.EenRidgeR~ad,Hurr.idityrRatb1~Marshr4.5,0,:,3
20 Ber:.R1dgeRoa.d, airtempera.t..ire, Rab.nitM.arsh, C, lr 1, 3.
21 Be~RidgeRoad,evapcra~ion,Rabbi~Marsh,-0.5,D,1,3 22 Be!"' .. RidgeRoad, transpiration, RabbitMa:rsh, 4 .5r -12. =:, .1, 3 23 Avoca,Hum.idity,RabbitMazs~,-5.5,3,4,3 24 Avcca,airtempera~~=e,Rabhi~Marsh,2,-11a5,4,3
Figure 3-9: The Advanced Version of the Input Dataset
As shown in Figure 3.9, location areal and location area 2 are nominal data type so
that they do not need to normalize before using association rule mining method. After
building the new dataset, we can use the WEKA to implement new rules.
3.3.4.2 New rules analysis
At this moment, we still set up the support range from 1 % to 100% and the confident
threshold as 0.5. The amount of the instances is still 443 and 43 (10%) of them are
used to evaluate the result.
After the generation, we got several rules. Some of them are same with previous
because the initial attributes and instances are not changed. Besides, we got some new
rules to describe the relationships between location areas and other items. Here are
some successful rules:
Rule: {(location2area = 4), (max_gap
airtemperature)} conf 0.71
[3-6]) => (locationlarea 3), (item
28
Methodology
In this rule, it shows the stronger relationships between locations and other items than
previous rules. In addition, the time range is smaller than previous as well. This rules
means if one phenomenon in one location gets the maximum value 3 to 6 hours later
than maximum value of rainfall in location area 4, the location will belong to location
area 3 and the phenomenon will be the air-temperature. The confidence is 71 % which
shows that the probability is higher than previous rules.
Compare with two rules in different datasets, we can find that the second version rules
concludes the first version rules, also the second version rules are better than first. The
new version rules can make the connection in location, item and value type with high
confidence. Then we also summarize ten reasonable rules to evaluate (we pruned the
same rules in previous result). Table 3.3 shows the result of evaluation.
Rules Accuracy ( 40 instance evaluate) Total number 10 Average 70% With confidence (> 80%) 3 73% With confidence ( <80%) 7 67%
Table 3-3: The New Result of Association Rules
It is easy to find that the accuracy of this rules are lower than previous. However,
those rules can show the stronger relationships in hydrological data and they are more
meaningful than previous rules. It proves that if we want to get detailed rules, we
cannot get the rules with high confidence.
The association rules mining are related the input dataset and it has some limitations
about the size and content of the dataset. Also the accuracy of the rules are related the
quality of datasets (Carlos 2006). So the dataset can make the influence for the quality
of the rules.
29
Discussion
4. Discussion
This chapter discusses the applications of patterns which are generated by the
association rule mining approach introduced in Chapter 3. In this chapter, we will also
introduce the feedbacks of our approach from some domain experts in the Bureau of
Meteorology. Finally, we will introduce how the discovered patterns are visualized in
our approach.
4.1 Introduction
After generating the rules by using the association rule mining, we invited some
domain experts to provide some feedbacks. Facilitated by the CSIRO Tasmanian ICT
Centre, we contacted and made an appointment with some experts in the Bureau of
Meteorology to discuss about our approach and the rules we discovered. These
discussions were motivated from three aspects:
• To find the objective ways about the association rules in hydrological events
by discussing with domain experts.
• To find the objective ways about the association rules by comparing the
difference between traditional hydrological model and association rules
mining method.
• To find a suitable ways about visualizing rules in order to make rules easy to
understand.
All the discussion of the result must be understood in the related of sample way, thus
it also is indicative rather than conclusive. The conducted evaluations of this project
have several limitations. If we want to justify the usefulness of the rules, we will
find some potential users to test and verify the rules and give the feedback. However
it is too hard to form a user group for this project due to time and resource
limitations. To overcome this limitation, we conduct several discussions with some
30
Discussion
experts in the water domain, and ask for their opinion about the approach we
developed, and their preferred way to present the discovered patterns.
4.2 Discussion with domain experts
Based on the findings in Chapter 3, we know that the rules can be used to predict
some hydrological events based on the occurrences of rainfall events. We need to
evaluate these rules and find whether they can be used into the real world. Therefore,
we ask for feedbacks from two domain experts in the field of hydrology, i.e., Mr Chris
Macgeorge and Mr Simon McCulloch from the Bureau of Meteorology.
4.2.1 Meeting with Chris Macgeorge
Mr. Chris Macgeorge is an expert in the Bureau of Meteorology who is in charge of
flood predictions and warnings. During the meeting with him, firstly, he introduced
some traditional models to approach forecasting. Generally if he focuses on the
forecast of a particular event, large amount of data that contains related data of that
event in a long term (more than 10 years) are required to be collected to build and
calibrate the forecast model. The model is normally based on complex mathematic
formulas.
Figure 4.1 describes the processes of his forecasting approach. In his approach, he
normally has a specific purpose at from the very beginning. Then, toward the purpose,
he starts to design and develop a model. After that, large amount of related data are
required to be collected from some particular locations. Obviously, the development
of such models is complex and requires long term periods of data with high quality. In
addition, such models are hard to be maintained and operated as they are normally
based on complex mathematic formulas or methods. These models can generate very
accurate forecasting result.
31
Discussion
Figure 4-1: Simple Process of Traditional Data Model
In the meeting, we also introduced the rules that were generated by association rule
mining. Comparing with the domain model, our rules are not powerful enough to
generate forecasting in high accurate levels. However, Chris mentioned that data
mining methods could be another suitable way to analyse data because this model is
only focus on data itself. It means that if we can provide a wide range of data as input,
data mining method can generate stronger and accurate result.
Finally, he suggested us to have discussion with another expert, i.e. , Simon
McCulloch who is in charge of modelling application and data analysis. He is also
familiar with mathematical modelling, thus he can give us more feedbacks about
related rules.
4.2.2 Meeting with Simon McCulloch
Mr. Simon McCulloch works in the Bureau of Meteorology and he takes charge in
data analysis and observation. He was interested in our findings and gave some useful
feedbacks and suggestions.
His work is normally to trace meteorological events and analyse data by using some
domain methods. The system can trace and monitor meteorological events via images
transferred from radars. During the discussion, he gave some suggestions about the
32
Discussion
extension work for using data mining model. He listed some potential relationship
between events and phenomenon. For example, weather situations are related with
rainfall and temperature. They may indicate us to improve the input data quality for
mining in future work. Besides, he mentioned that there are two important
components during the analysis process: analysis_ result and the visualization. It is
obvious that the results will mean nothing if they cannot be understood by domain
experts or users.
He also suggested us to describe rules in a user friendly way rather than mathematic
statistics. Then he showed us some models and applications in the radar monitoring
system and the data collected from radar. In their system, users can easily find the
details for every specific phenomenon. And it provided an open platform to allow
users manipulating specific data and analysing data.
For our findings of that rules, he said it could be a good way to build a suitable
interface to show how it works, then users can find the value of that rules easily.
These comments indicate us that we should focuses on combining data mining models
and data visualization together in the approach.
4.3 Implication for comparing different models
After the discussions, we have got some useful feedbacks from domain experts. Those
feedbacks can be summarized as follows:
+ Our findings cannot reach to the high accuracy by comparing the prediction or
forecasting in hydrological model.
+ Data mining is a good way to analyze hydrological data with its low data quality
requirements. If we can improve the quality of dataset as for input, the mining
model can generate more accurate results.
+ It is important to visualize rules to improve the level of understanding which can
prove the usefulness of rules
33
Discussion
4.3.1 Traditional hydrological model
In the water domain, hydrological data are normally manipulated and interpreted by
traditional hydrological models. These models are very powerful in stream forecasting,
water quality monitoring, etc. However, most hydrological models have some
limitations which can block their applications.
Firstly, Most of hydrological models ask for high level domain knowledge to generate
the structure and analyse data. It means that only the domain experts can use this
method to analyse data. However, from now on, more users prefer to use hydrological
prediction service without domain knowledge. According to National Research
Council (2006), there are many users groups requiring the hydrological prediction
events such as farmers, schools, governments. In addition, most of users have less
domain knowledge. If there is no domain support, they cannot analyse the data and
get the result.
Then most of hydrological models require high quality data. Due to the particularity
of water related data, the real water data ask for the integrity and veracity. The data
also need to manipulate accuracy during the process.
Finally, Most of hydrological models have the specific region like different locations
and topography. Also different climatic zone can influence the hydrological events.
4.3.2 Data mining approaches vs. hydrological model
In general, traditional hydrological model can generate more accurate results but with
very high costs. For example, the hydrological model can give predictions about a
specific location and phenomenon, e.g., water level in location A will increase by
1 Ocms in the next 10 hours. However, these results are generated based on high
quality input data. In addition, these models may require over 10 years' data for
34
Discussion
calibration and training purposes. Such requirements will lead to very high cost on
data preparation, and greatly limit the applications of the models. On the other hand,
data mining methods have much lower requirements on data quality. Most data
mmmg methods are purely data driven. Namely, the discovered
information/knowledge is based on the data availability. Hence, even without long
period and high quality data, we still can find some useful patterns from the data.
Meanwhile, these patterns are easier to understand for general users without too much
domain knowledge.
Also our findings in data mining model are shown by mining software which is the
simple characters and statistics, thus this research can indicate combining data mining
results and visualization together to describe rules and allow user to understand rules
easily.
4.4 Rule Presentation
In order to facilitate users to understand discovered knowledge, data presentation
becomes to an important issue for the success of the approach. Data presentation is a
suitable way to transfer the obscure data information into perceptible information
(Pittelkow & Wilson 2009). A good presentation can let data easy to understand and
provide clear information from data to potential audience. Data presentation also can
prune the redundant data and provide the friendly interface.
In this research, we focused on the presentation of discovered association rules and try
to embed the presentation with the current interface of the South Esk Hydrological
Sensor Web. Also that can indicate the combination model about the data mining
methods and data visualization function.
4.4.1 Interface design
According to Chapter 2, data visualization can help users to understand information.
35
Discussion
In this research, in order to describe the rules clearly, we built up a tree view structure
to describe the relationships between locations, sensors and phenomenon types. As
shown in Figure 4.2, in this structure, locations, sensor types and phenomenon types
are shown clearly. So it can also help the users to understand the relationships
between observed phenomenon types and locations.
~i~~4 Sa~'m Es~ ~~t;~~l,w!~: ~!~t~:~t~~r¥f;1z,;~t?> W""lHfii#~ft~nfi~1CJl~{¥vl~~j; ,~Pr~%\\:?~;:{",~~ --~~ 'w ~:, 3~);':'Y\\~i~ ~ - <
automatic V\'ater sensor
Ben Lornond , Stor-ys Cn=-E::::k
Ben Ridge Rc•ad Tc·"'""'- Hill CoxRoad St Patrlckshead
English Towf) Rp'ad
~' ,MathlnnaPlalnsRoad
Toi lochgon...JmFann
·Vall<>y Roa 0.::l ·
'HogansRoad
seietr& >>
Figure 4-2: Tree View Structure of the Sensor Web
· r·a1nfall batt~r-yv~
1·a1nfall batteryv<
rainfall
•
b ::ittervv·c ;
rainfall batter\.-vr
r·ci1nfall ' battar-yvr ,
1-alnfall · batte1~y vc ~
rainfall batteryvc '
Comparing with the initial sensor web (See Figure 1-2), this structure is more clearly
to show the relationship between locations and phenomenon. Also people can see
details information by clicking every node of this tree view interface. To connect with
the initial sensor web, we created the channel to show the details of locations and
status in related phenomenon as shown in Figure 4.3. The red-signed location is
signed by clicking related location and the status belongs to selected phenomenon (e.g.
humidity). Also that can help people to understand the situation information in
specific phenomenon and location.
36
Discussion
$i)
m:
'1'~
" $() '
3
Figure 4-3: Related Location and Status in Specific Phenomenon
To present the rules, we provide a set of selection boxes to allow users to choose their
preferred rules followed by Figure 4.4. Based on the "maximum value time" of a
rainfall event, there are four selection boxes to allow users to select locations,
phenomenon types, value types and time gaps. When a user sets three attributes from
any three boxes, the value of the other box can be generated based on the association
rules in the rule base. For example, if the rainfall gets the maximum value at 2: 00 am,
and we select the location as BenLomond, phenomenon type as Humidity and the
value type as Maximum Value, then the rules will be invoked and it can be found that
the maximum value time will at 21 :00 pm in previous day. In addition, the application
will give the confidence values of the association rule as well. This application of data
presentation is the extension of the rules from data mining approaches which can
visualize the rules and make rules easy to understand. In addition, it also provides a
friendly interface to let user can manipulate rules and know how rules work in sensor
data.
37
,z-,;----------------The StorysCreek Rainfatl have max value at
Pl~ase select at least One ~ptions:
Humidity .s.houtd get the Maxtmum value at:
21 O'Clock at previous day
Figure 4-4: Application of Rules from Data Mining
Discussion
Figure 4.5 is the integrated interface for visualize data mmmg rules and the
conception of the hydrological sensor web. It has functions that indicate people to see
and search related locations and phenomenon in SE sensor web. Also that allows users
to manipulate the interface to understand rules.
.automatic water sen&.."Y
Ben Lamond Storv:;Cre€k Ben Ridge Road Tower Hill Co>Road St.Patrlckshead
Engli5h Town Road .RaLi:>it Marsh Matl1ir·naPlainsRoad
rain-related -.·.iatBr sensor -...& Tower Hill Road TollochgorumFi'lm
· Valley Road · K~'1ansRoad
.Detia!
ltie: .Storys:Crear. R&ntall M¥"emax \'alue at
:§L! P.fi&$JS&~1 tlftl:m~i('<f:-i'l':<JQtl~a~
[~~~~~~~~=.:.1EJ E~~~~i.~=~]~J I~~~~::~J:!J
'H--.1.mumv sl!W~ >;et ee Maxtm>..lm "-"loo. ;,t:
21 O'"Clocfl. et Pltt'rioUS day
Figure 4-5: The Integrated Interface in Sensor Web
38
Discussion
4.4.2 Implementation
This visualization interface was developed us mg the Java language and XML
formatted file as for related database. Java is used to develop the interface and parse
XML formatted file to get the data from XML-based database. XML file is used to
store the sensor information and rules details.
To show the tree view of the sensor web, we prefer using the tree visualization toolkit
for the frame and invoke data from XML file to build the structure. Then we built the
connection between location map and status phenomenon with the tree structure. After
that, we parse related rules which stored in XML file and show the descriptions in that
interface.
To select and generate rules, we designed a simple algorithm to invoke rules based on
users' selection in the interface. Figure 4.6 shows the rules selection. Users can select
their interested target via selecting different items in that interface (e.g. select location,
item and leave value type and time unknown). Then server part can receive the action
for users and match related rules in database. If the rules match perfectly, server will
obtain related rules and describe rules in that interface by prediction function and give
the confidence to show the probability.
To match related rules, we design a simple algorithm to search and invoke. We set up
a rules filter to divide different rules in different groups which conclude location-base,
item-base, valuetype-base and valuetime-base. For example, rule
{(min_gap:[(-5)-(-2)]) => (item:airtemperature) } belongs to item-base,
valuetype-base and valuetime-base. If the system receive the related information
which concludes one of them from users, this rule will be invoked and displayed in
the related interface.
39
Discussion
M~tch Matched Rules \ !
\ ~--,"~·,-~ i
"'', f XML formatted ' / '';.-::. ,.
i file ":-,.,w_,.,~_..,,
Figure 4-6: Rules Selection Process Flow
Unfortunately, we currently do not have a user group to evaluate the interface in the
approach. However, we had built a combination model to show the result of analysis,
and this interface provides a suitable way to describe our findings. In addition,
comparing with initial hydrological web, this interface allows users to manipulate
related data and events and they can get the good feedback via select related items.
Such interactions can filter out useless rules based on users' preferences.
40
Conclusion & Future Work
5. Conclusion & Further Work
This chapter shows the goals of this research and the review of the contributions in
related work. It displays the future work for enhancement of the research as well.
5.1 Conclusion
Event prediction plays an important role in the water domain. In this research, we
focused on using data mining approaches to achieve knowledge discovery. We select
the association rules to analyse the relationships and patterns among different sensor
observations on the Hydrological Sensor Web. Compare with traditional hydrological
models, data mining methods have lower requirements on data quality and domain
expertise. In this paper, we described how to collect and prepare data from the Sensor
Web, and use a specific data mining workbench to achieve knowledge discovery by
using association rules. In addition, we also introduced the presentation of discovered
rules.
As mentioned in chapter 2 and chapter 3, this research focuses on the mode ling work
and finding relationships between hydrological data and events. The following data
mining model can provides the flexible way to find the potential relationship in
South-Bek sensor web. The generated rules can make simple prediction in specific
hydrological phenomenon. Although those sorts of rules are not as stronger as
traditional hydrological model, data mining methods require lower data requirement
and less domain knowledge. In addition, it is easier to improve data mining rules by
considering the dataset or related algorithms.
Then in chapter 4, we discussed the application of the rules and provided the interface
to visualize rules. We combine the data visualization and data mining model together
to present rules and help users to understand related rules easily.
41
Conclusion & Future Work
In conclusion, to compare the traditional hydrological model, based on the features of
South-Esk sensor web, we provide the specific model to disposal hydrological data.
To improve the application of sensor web and indicate the usefulness of hydrological
data.
5.2 Contributions
The greatest contribution of this research work is to build the specific model to
disposal sensor data and find relationships between hydrological data and events for
South-Esk sensor web. This research use data mining methods to find relationship and
data visualization technology to present the rules and enhance the usefulness in the
sensor web. In addition, this research had built up the user friendly interface to
visualize rules and South-Esk sensor web which can allow users to manipulate demo
to understand rules and phenomenon easily.
5.3 Future Work
Based on the model we had built in this research, the future work of this research will
be the investigation of the application other data mining in environment monitoring.
In addition, we will improve the integration of data mining model and data
visualization technology to present data more clearly. Also we will combine the
monitory system and web technology to present sensor data more comfortable.
For the improvement of the data mining methods, we will pay more attention to
selecting more data mining methods and build different dataset. The potential
relationship purpose could between wind and water related events. We will take time
to compare existing algorithms and select or improve algorithms to achieve more
useful relationship or patterns.
For the data visualization, we will focus on developing user friendly interface on web
42
Conclusion & Future Work
environment and allow users manipulate online. We will embed functional parts such
as WE.KA Explore or other data mining platforms to allow users using data mining
methods to realize their interesting information.
We will improve the modeling work for sensor web data and set this kind of model
into widely use such as in meteorological and hydrological domain.
43
Reference
6. Ref ere nee
AK. JAIN, MNM, P.J. FLYNN 1999, 'Data Clustering: A Review', ACM Computing Surveys, vol.
31, no. 3, pp. 265-323.
Anthony, H & Vmny, C 2004, 'Route profiling: puttmg context to work', Proceedings of the 2004
ACM symposium on Applied computing, ACM, Nicosia, Cyprus.
Arawal, R. and Srikant, R., (1994), "Fast algorithms for mining association rules in large databases",
Proceedings of the 201h International Conference on Very Large Data Bases, SanFrancisco, pp 487-499.
Borman, S 2004, 'The Expectation Maximization Algorithm A short tutorial'.
Carlos, 0 2006, 'Comparing association rules and decision trees for disease prediction', Proceedings
of the international workshop on Healthcare informat10n and knowledge management, ACM,
Arlington, Virgmia, USA.
Chong, L, Feng, W, Jun, S & Chang Wen, C 2009, 'Compressive data gathering for large-scale wireless
sensor networks', Proceedings of the l 5th annual internatzonal conference on Mobile
computing and networking, ACM, Beijmg, China.
Committee to Assess the National Weather Service Advanced Hydrologic Prediction Service Initiative,
National Research Council, Toward a new advanced hydrological prediction service, Nat10nal
Academy of Science, 2006
Durso,F & Sethumadhavan,A., 2008, Situation Awareness: Understanding Dynamic Environments,
Human Factors: The Journal of the Human Factors and Ergonomics Society, pp. 443-448.
Endsley, M.R., 1995a, Towards a theory of situation awareness m dynamic systems,
Human Factors, Vol. 37, pp. 32-64.
44
Reference
Garsofiky, B., Schwan S., & Hesse, F. W. (2002). Viewpoint dependency in the recognition of dynamic
scenes. Journal of Experimental Psy-chology: Learning, Memory and Cognition, 28,
1035-1050.
Hall M., 2009, The WEKA Data Mining Software: An update, 'SIGKDD Explorations', Vo! 11, Issue 1,
ppl0-18
Han, J. (1998). "Toward on-lzne analytical mining in large databases," ACM Sigmod Record, 27(1)
97-107
Ian H, EF 2005, Data Mining Practical Machine Learning Tools and Techmques, Second Edition edn,
MORGAN KAUFMANN.
Ikuhisa, M, Michihiko, M, Tsuneo, A & Noboru, B 2009, 'Sensing web: to globally share sensory data
avoidmg privacy mvasion', Proceedings of the 3rd International Universal Communication
Symposium, ACM, Tokyo, Japan.
Jiawei, H. and Micheline, K., 2006, Data Mining: Concepts and Techmques, MORGAN
KAUFMANN PUBLISHER.
Kevin, N, Nithya, R, Mohamed Nabil Hajj, C, Laura, B, Sheela, N, Sadaf, Z, Eddie, K, Greg, P, Mark,
H & Mani, S 2009, 'Sensor network data fault types', ACM Trans. Sen. Netw., vol. 5, no. 3, pp.
1-29.
Keim,D., 2002, "Information Visualization and Visual Data Mimng", IEEE TRANSACTIONS ON
VISUALIZATION AND COMPUTER GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH
2002, pp.100-107.
Kiyoharu, A, Datchakom, T, Shinya, K & Toshihiko, Y 2004, 'Efficient retrieval of life log based on
45
Reference
context and content', Proceedings of the the 1 st A CM workshop on Continuous archival and
retrieval of personal experiences, ACM, New York, New York, USA.
Klein, A & Lehner, W 2009, 'Representing Data Quality m Sensor Data Streaming Environments', J
Data and Information Quality, vol. 1, no. 2, pp. 1-28.
Liang, X. and Liang, Y., 2001, "Applications of data mining in hydrology". Proceedings of the
IEEE international Conference on Data Mining, pp 617-620.
Linda, MZ & Hisashi, K 2002, 'A simplified EM algonthm for detection of CPM signals m a fading
multipath channel', Wzrel. Netw., vol. 8, no. 6, pp. 649-658.
Linden,G & Hanks,S., 1997, "Interactive Assessment of User Preference Models: The Automated
Travel Assistant", Proceeding of the Sixth International Conference, pp. 67 -79.
Liu, Q., Bai, Q. and Terhorst, A., 2010, "Provenance-Aware Hydrological Sensor Web", the Proceedings of Hydro informatics Conference, pp. 1307-1315, Tianjin, China.
Mark, H, Eibe, F, Geoffrey, H, Bernhard, P, Peter, R & Ian, HW 2009, 'The WEKA data mining
software: an update', SIGKDD Exp/or. News/., vol. 11, no. 1, pp. 10-18.
Mi Li, GH, Bernhard Pfahnnger 2004, 'Clustering Large Datasets Using Cobweb and K-Means in
Tandem', Lecture Notes in Computer Science, vol. 3339, pp. 368-379.
Nachiketa, S, Jamie, C, Ramayya, K, George, D & Rema, P 2006, 'Incremental hierarchical clustering
of text documents', Proceedings of the 15th ACM international conference on Information
and knowledge management, ACM, Arlmgton, Virginia, USA.
Open Geospatial Consortmm, 2007, "OGC Sensor Web Enablement: Overview and High Level Architecture". Technical Report OGC 07-165.
Peter, D, Deepak, G & Prashant, S 2005, 'TSAR: a two tier sensor storage architecture using interval
46
Reference
skip graphs', Proceedings of the 3rd international conference on Embedded networked
sensor systems, ACM, San Diego, California, USA.
Pittelkow, Y E. and Wilson, S R., 2009, "Visualization of Gene Expression Data", The Berkeley
Electronic Press.
Jeffrey, SW, 2004, Data Mining: An Overvie11~ Congress Research Service,.
Stephanie Lindsey, CR, Knshna Sivalingam 2001, Data Gathering in Sensor Networks usmg the
Energy Delay Metric, Los Angeles.
Soukup,T & Davidson,!., 2002, "Visual Data Mining: Techniques and Tools for Data Visualization and
Mining", John Wiley & Sons, Inc, first ed1t10n.
Su, F., C. Zhou, V. Lyne, Y. Du and W. Shi. (2004). "A data-mining approach to determine the
spatio-temporal relationship between environmental factors and fish distribution." Ecological
Modelmg 174.4 421-431.
Thomas, C and Fischer, G, 1996. "Using agents to improve the usability and usefitlness of the
World-Wide Web", In Proceedings of the Fifth International Conference on User Modeling,
5-12.
Tapas Kanungo, DMM, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu
2002, 'An Efficient k-Means Clustering Algorithm: Analysis and Implementation', IEEE
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 24, no. 7,
pp. 881-892.
47
Reference
7. Appendix
A. Related materials of data mining model
The initial input dataset arff file named "test_ data.arff'' 1s in the "Data Mining
Materials" folder of the included CD-ROM.
The advanced input dataset arff file named "test_ data_ advacned.arff'' is in the "Data
Mining Materials" folder of the included CD-ROM.
B. Data visualization materials
The demo of the visualization is named "demo.wmv" in the "Data Visualization
Materials/Demo" folder of the included CD-ROM.
The source code and the XML data base of the demo in the "Data Visualization
Materials/Source" folder of the included CD-ROM.
48