intelligent pre-processing for data mining
TRANSCRIPT
INTELLIGENT PRE-PROCESSING FOR DATA MINING
by
LUDWIG DE BRUIN
DISSERTATION
submitted in fulfilment of the requirements for the degree of
MASTER OF SCIENCE
in
INFORMATION TECHNOLOGY
at the
UNIVERSITY OF JOHANNESBURG
SUPERVISOR: PROF. E.M. EHLERS
CO SUPERVISOR: DR. G.J. VAN NIEKERK
2013
2
Keywords
Data mining, data pre-processing, classification, clustering, computational intelligence,
intelligent agents
3
Abstract
Data is generated at an ever-increasing rate and it has become difficult to process or
analyse it in its raw form. The most data is generated by processes or measuring
equipment, resulting in very large volumes of data per time unit.
Companies and corporations rely on their Management and Information Systems (MIS)
teams to perform Extract, Transform and Load (ETL) operations to data warehouses on
a daily basis in order to provide them with reports. Data mining is a Business
Intelligence (BI) tool and can be defined as the process of discovering hidden
information from existing data repositories. The successful operation of data mining
algorithms requires data to be pre-processed for algorithms to derive IF-THEN rules.
This dissertation presents a data pre-processing model to transform data in an
intelligent manner to enhance its suitability for data mining operations. The Extract Pre-
Process and Save for Data Mining (EPS4DM) model is proposed. This model will
perform the pre-processing tasks required on a chosen dataset and transform the
dataset into the formats required. This can be accessed by data mining algorithms from
a data mining mart when needed.
The proof of concept prototype features agent-based Computational Intelligence (CI)
based algorithms, which allow the pre-processing tasks of classification and clustering
as means of dimensionality reduction to be performed. The task of clustering requires
the denormalisation of relational structures and is automated using a feature vector
approach. A Particle Swarm Optimisation (PSO) algorithm is run on the patterns to find
cluster centres based on Euclidean distances. The task of classification requires a
feature vector as input and makes use of a Genetic Algorithm (GA) to produce a
transformation matrix to reduce the number of significant features in the dataset. The
results of both the classification and clustering processes are stored in the data mart.
4
Table of Contents
Keywords 2 Abstract 3 List of Figures 8 List of Tables 10 List of Algorithms 11 List of Abbreviations 12 Chapter 1 14 Introduction 14
1.1 Background 14 1.2 Purpose 15 1.3 Research questions 15 1.4 Methodology 16 1.5 Dissertation structure 16 1.6 Conclusion 20
Chapter 2 22 Data storage evolution 22
2.1 Introduction 22 2.2 Background and context 22
2.2.1 The Relational Data Model 26 2.2.2 Segregated environment 27 2.2.3 Viewing of Data 30
2.3 Conclusion 31 Chapter 3 33 Database models 33
3.1 Introduction 33 3.2 Database models 33
3.2.1 The file system models 33 3.2.2 Disadvantages of Traditional File System Models 37 3.2.3 The Relational Model 38 3.2.4 Object-oriented model 43 3.2.5 Object-relational model 46 3.2.6 XML data model 47 3.2.7 Data manipulation 48
3.3 ANSI/SPARC architecture 49 3.4 Important database concepts 49
3.4.1 Schema 50 3.4.2 Virtual Table (View) 50 3.4.3 Metadata 51 3.4.4 Data dictionary 51 3.4.5 System catalog 52 3.4.6 Indexing 52 3.4.7 Transactions and concurrency control 53
3.5 Specialised database models 55 3.5.1 Document-oriented model 55 3.5.2 Temporal model 56
5
3.5.3 Spatial model 55 3.5.4 Fuzzy object model 56 3.5.5 Dimensional model 57 3.5.5 NoSQL databases 57
3.6 Conclusion 57 Chapter 4 60 Data warehousing 60
4.1 Introduction 60 4.2 Business Intelligence 60 4.3 Data warehouse fundamentals 61
4.3.1 Subject oriented 61 4.3.2 Integrated 62 4.3.3 Non-volatility 65 4.3.4 Time variant 66
4.4 Creating analytical data 66 4.4.1 Selecting data sources 66 4.4.2 Integrating the data 68 4.4.3 Loading operational data 69
4.5 Data warehouse and data models 71 4.5.1 Inmon’s top-down approach 71 4.5.2 Kimball’s bottom-up approach 72
4.6 OLAP and data warehouse 75 4.6.1 Data analysis example 75 4.6.2 OLAP overview 78
4.7 Data Warehouse structures and NoSQL databases 79 4.8 Conclusion 79
Chapter 5 82 Data Mining 82
5.1 Introduction 82 5.2 Knowledge Discovery in Databases (KDD) 83 5.3 Overview of the data mining process 84 5.4 Data pre-processing tasks 90
5.4.1 Data denormalisation 90 5.4.2 Clustering 91 5.4.3 Classification 94 5.4.4 Dimensionality Reduction 95
5.5 Dimensionality Reduction techniques 99 5.6 Rule Extraction 101 5.7 Conclusion 102
Chapter 6 105 Intelligent Agents 105
6.1 Introduction 105 6.2 Artificial Intelligence 105 6.3 Intelligent Agent 106 6.4 Multi-agent systems 110 6.5 Conclusion 112
Chapter 7 115 Computational Intelligence 115
6
7.1 Introduction 115 7.2 Computational Swarm Intelligence 115
7.3 Evolutionary Computation 121 7.4 Conclusion 124
Chapter 8 126 The EPS4DM model 126 8.1 Introduction 126 8.2 Traditional ETL compared to the EPS4DM model 126 8.3 Clustering 128 8.4 Classification 132 8.5 Overview of the data mining mart 135 8.6 Conclusion 136 Chapter 9 139 The EPS4DM’s data extraction process 139 9.1 Introduction 139 9.2 Mapping relational structures 139 9.3 Process overview 143 9.4 Conclusion 144 Chapter 10 146 EPS4DM's Clustering process 146 10.1 Introduction 146 10.2 Determining cluster centres 146 10.3 PSO clustering algorithm 148 10.4 Measuring cluster quality 153 10.5 Process overview 154 10.6 Conclusion 154 Chapter 11 157 EPS4DM's Classification Process 157
11.1 Introduction 157 11.2 The GA Feature extractor 157 11.3 Chromosome mapping 160 11.4 Crossover operator 160 11.5 Mutation operator 161 11.6 Selection strategy 162 11.7 KNN rule and training data 163 11.8 Fitness function 163 11.9 GA parameters 165 11.10 Conclusion 165
Chapter 12 168 Implementation and Results 168
12.1 Introduction 168 12.2 Java based implementation 168 12.3 The Test setup 172 12.4 PSO-based clustering results 173 12.5 Clustering conclusion 180 12.6 GA-based classification results 180 12.7 GA algorithm parameters 180 12.8 Randomised datasets 180
7
12.9 UCI-standard datasets 183 12.10 High-dimensional datasets 184 12.11 Classification conclusion 184 12.12 The data mining mart 185 12.13 Conclusion 185
Chapter 13 188 Conclusion and Future work 188
13.1 Introduction 188 13.2 Evaluation 188 13.3 Research-driven questions that had to be answered by the dissertation 189 13.4 Future work 192 13.5 Conclusion 192
References 194 Addendum A 207
8
List of figures
Figure 2-1 Visualisation of extract programs creating child databases 25 Figure 2-2 Visualisation of a shared nothing cluster 28 Figure 2-3 Visualisation of a shared disk cluster 29 Figure 2-4 The segregated database environment 29 Figure 3-1 Example hierarchy of organisation 35 Figure 3-2 Example record structure of an employee 36 Figure 3-3 Example graph of cities showing distances 36 Figure 3-4 Example use of foreign key 44 Figure 3-5 Example of a normalised database structure 45 Figure 3-6 Example OO data model 46 Figure 3-7 ANSI/SPARC architecture 50 Figure 3-8 Example of an indexing tree 53 Figure 4-1 Subject orientation 62 Figure 4-2 Encoding of gender data 63 Figure 4-3 Semantic translation of data 63 Figure 4-4 Conversion of data format 64 Figure 4-5 Time-key hierarchical differences 65 Figure 4-6 The non-volatility of data warehouse data 66 Figure 4-7 Inmon’s data warehouse architecture 72 Figure 4-8 Dimensional modelling – star schema 74 Figure 4-9 Star schema showing derived dimension tables 74 Figure 4-10 Kimball’s data warehouse architecture 75 Figure 5-1 KDP in the production environment 83 Figure 5-2 The data mining process 84 Figure 5-3 Data selection 86 Figure 5-4 Visual representation of outliers 88 Figure 5-5 Linear task-oriented overview of the data mining process 89 Figure 5-6 Clustering example 92 Figure 5-7 Visual representation of hill-climbing clustering 93 Figure 5-8 Divisive clustering 94 Figure 5-9 Visual depiction of K-Nearest Neighbour classification 95 Figure 5-10 Class identification in one dimension 97 Figure 5-11 Class identification in two dimensions 97 Figure 5-12 Class identification in three dimensions 98 Figure 6-1 Architecture for a simple reflex agent 107 Figure 7-1 Particles moving towards the new global best 118 Figure 7-2 Charged particles being repelled 120 Figure 7-3 Crossover operation of two Chromosomes 122 Figure 8-1 Traditional ETL compared to the EPS4DM model 127 Figure 8-2 The clustering process as used in the EPS4DM model 129 Figure 8-3 Associativity between target records [Alf07, Ray10] 130 Figure 8-4 Deriving feature vectors from relational data 130 Figure 8-5 The data clustering process 131 Figure 8-6 Feature Vector normalisation 133 Figure 8-7 Building the normalised feature vector 133 Figure 8-8 Building the transformation vectors [Far07] 134
9
Figure 8-9 The classification process 135 Figure 8-10 The data mining mart 136 Figure 9-1 Associativity between target records combined as feature vectors [Alf07] 140 Figure 9-2 The effect of feature weights to expose outliers [Ray00] 141 Figure 9-3 Example feature vector with indexes to complex types 142 Figure 9-4 Example of a feature pattern being built 142 Figure 9-5 Data extraction process 144 Figure 10-1 PSO particle representing a solution in two dimensions with four clusters 150 Figure 11-1 GA/KNN feature extractor 159 Figure 11-2 Bit mask used for chromosome mapping [Ray00] 160 Figure 11-3 Feature weight used as chromosome mapping 160 Figure 11-4 Crossover operation of mapped feature vectors [Eng07] 161 Figure 11-5 Mutation operator of a single feature vector [Eng07] 162 Figure 12-1 Helper classes within the EPS4DM prototype 170 Figure 12-2 PSO clustering classes within the EPS4DM prototype 171 Figure 12-3 KNN classifier and GA chromosome classes used within the EPS4DM prototype 173 Figure 12-4 GA algorithm and result classes used within the EPS4DM prototype 174 Figure 12-5 K-Means and PSO cluster centre analysis 175 Figure 12-6 K-Means and PSO cluster centre analysis with error rate feedback 176 Figure A.1 Simple Iris relational structure 211 Figure A.2 Complex Iris relational structure 211
10
List of tables
Table 3-1 Example flat file showing login times of employees 34 Table 3-2 Naming conventions 41 Table 4-1 OLTP Invoice Table 76 Table 4-2 OLTP Invoice Line Table 77 Table 4-3 OLAP Dimensional Table 77 Table 11.1 GA Parameters 165 Table 12.1 Clustering data sample from Iris dataset 169 Table 12.2 Classification data sample from Iris dataset 169 Table 12.3 Intra-cluster distances on the Iris dataset 177 Table 12.4 Error rates on the Iris dataset 177 Table 12.5 Intra-cluster distances on the Iris dataset for simple relational mapping 178 Table 12.6 Error rates on the Iris dataset for simple relational mapping 178 Table 12.7 Intra-cluster distances on the Iris dataset for complex relational mapping 178 Table 12.8 Error rates on the Iris dataset for complex relational mapping 178 Table 12.9 Fitness values and cluster centres for three sets of random data 179 Table 12.10 GA-based algorithm parameters 181 Table 12.11 Generated Feature matrix 182 Table 12.12 Transformation matrix 풘∗values 182 Table 12.13 Iris and Glass datasets 184 Table 12.14 Error rate of 2, 4-D dataset 185 Table A.1 The Iris dataset 207
11
List of algorithms
General K-Means clustering 91 Hill climbing clustering 93 K-Nearest neighbour classification 95 Principal component analysis 99 Particle Swarm Optimisation (PSO) 116 Genetic algorithm 123 Traditional K-Means to find cluster centres 146 PSO clustering 148 PSO algorithm to calculate cluster centres 151
12
List of abbreviations
ACID Atomicity Consistency Isolation Durability BI Business Intelligence AI Artificial Intelligence ANSI American National Standards Institute ASCII American Standard Code for Information
Interchange CI Computational Intelligence CPSO Charged Particle Swarm Optimisation CRUD Create Read Update Delete DBMS Database Management System DDL Data Definition Language DML Data Manipulation Language EBCDIC Extended Binary Coded Decimal Interchange Code ERD Entity Relationship Diagram ETL Extract Transform Load GA Genetic Algorithm KDD Knowledge Discovery in Databases KDP Knowledge Discovery Process KIF Knowledge Interchange Format KNN K-Nearest Neighbour KQML Knowledge Query and Manipulation Language MAS Multi-Agent System MIS Management Information System OO Object Orientation OODBMS Object-Oriented Database Management System OLAP Online Analytical Processing OLTP Online Transactional Processing ORM Object Relational Mapping O/RDBMS Object-Relational Database Management System PSO Particle Swarm Optimisation RSS Rich Site Summary SPARC Standards Planning and Requirements Committee SQL Structured Query Language RDBMS Relational Database Management System TSP Travelling Salesperson Problem XML eXtensible Mark-up Language
13
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational
intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
14
Chapter 1
Introduction
1.1 Background
With the advent of inexpensive multi-gigabyte storage devices and broadband network
connectivity options, organisations, businesses and individuals are collecting data at an
alarming rate. The sheer volume of data has become so large that it is practically
impossible to visualise the available data, let alone, analyse it.
Human-generated data (i.e. documents, letters and records) constitute only a small
fraction of the overall body of data. Electronic systems such as online searches and
banking transactions are responsible for the bulk of the data. This data is accumulated
in the hope that it might be of use at a later stage, but according to Witten there is an
ever growing gap between data and the understanding thereof [Wit05].
The underlying principle is that collated data is effusive in nature, as it needs to
describe transactions, maintain member states and thus allow Online Transaction
Processing (OLTP) systems to meet business’ needs. The normal data mining process
dictates that an Extract, Transform and Load (ETL) process from either a relational
database or data warehouse is used as input for the rule extraction algorithms. This
implies that all data from a relational database or data warehouse are subject to pre-
processing. Organisational rules dictate the frequency of ETL processes. They usually
occur once nightly, which means that data mining pre-processing tasks are run on data
that is already dated.
The objective of this dissertation is to propose a data mining pre-processing system that
will manage the data mining ETL pre-processing tasks and store the pre-processed data
in a data mining mart. The system maintains the data from this mart and thus the data
mining algorithms can utilise the data in this mart as their input. This allows the
algorithms to use this data mining mart as their repository and by having the most up-
to-date data, the algorithms are effectively run whenever the need arises, without
having to spend costly time on performing the pre-processing tasks.
15
1.2 Purpose
The purpose of this dissertation is to provide an overview of the current state of
production OLTP databases and data warehousing structures as the input to the data
mining process. Also to investigate the current data mining pre-processing tasks and
how computational intelligence (CI) based algorithms coupled with multi-agent
technology can manage these tasks.
A data pre-processing model, the EPS4DM model, is proposed which will collate the
required relational data as decided upon by a user, perform the pre-processing tasks
and store the processed data in a data mining mart. This enables the data mining rule
extraction algorithms to use this mart as input source for their process.
The dissertation investigates the viability of implementing agents, embedded with CI, to
orchestrate the pre-processing tasks in order to maintain a data mining mart. To direct
the contributions of this dissertation, four research questions have been devised.
1.3 Research Questions
To enhance the success of this dissertation and to support the goal of designing a pre-
processing system, the following questions are posed to guide the research:
Question 1: Why are OLTP system databases structured the way they are, how is this
data stored and in what way does data warehousing differ from data
mining?
Question 2: What are the current data mining pre-processing tasks and how are they
performed?
Question 3: How can multi-agent technology embedded with CI orchestrate and
execute the pre-processing tasks?
Question 4: How viable is it to maintain a data mining mart with pre-processed
relational data?
16
The following section will provide an overview of the research methodology. 1.4 Research Methodology
The research process is derived from the motivation to model a system to maintain a
data mining mart of pre-processed data. To reach this goal, background to the problem
is given and the motivation formulated. Research questions aimed at answering the key
aspects of the goal of the dissertation are also established.
The second step is to conduct an in-depth literature study to provide insight into
current production database structures, data warehousing, the data mining process and
the pre-processing steps. The literature study addresses research question 1. A process
methodology is required when there needs to be an understanding of the processes
needed to accomplish tasks in a computer science application. An investigation into and
explanation of the data mining processes and the techniques used to perform the pre-
processing tasks addresses research question 2. The study also incorporates the
mathematical foundation of the processes involved [Ama07].
The third step is to propose a model for orchestrating the tasks required by questions 3
and 4. This model encompasses the mathematical, algorithmic and architectural aspects
related to each sub-task of the global process. The experimental methodology is utilised
when solutions to problems within computer science need to be evaluated. This will
help to answer the mentioned questions. The proposed model is then implemented as a
prototype system [Ama07]. The fourth and final step is to provide the empirical results
of the prototype, the viability of maintaining a data mart and any future improvements
of the work.
The next section provides an overview of the structure of the dissertation.
1.5 Dissertation structure
The structure of the dissertation is summarised in Figure 1.1 and each chapter’s content
is discussed in this section.
17
Chapter 1 provides the background to the problem and introduces the research
questions and dissertation structure.
1.5.1 Background
Chapter 2 gives a brief historic overview of how data storage has evolved since the early
days of computing. The purpose of the chapter is to introduce legacy data storage
infrastructures, the way in which they evolved into the now widely adopted relational
model of data, and why the need arose for a segregated data environment.
Chapter 3 provides an overview of the different database models (past and present) and
the fallacies associated with them, this is done to justify why production environments
use the models they do. The purpose of this chapter is to describe the constraints and
rules associated with a production transaction management database (OLTP system)
which must address the data needs of an organisation. The inherent problems of the
relational database can be addressed by making use of an exotic database model.
Chapter 4 gives an overview of data warehousing, why it is needed, how it relates to the
relational database and in what way data is stored in a warehouse. The purpose of the
chapter is to explain why data warehouses exist, how they provide real-time analytical
processing of data (OLAP) and to introduce the concepts behind the inner workings of
the multi-dimensional data structure.
Chapter 5 provides an overview of the data mining process and the way in which it
forms part of the larger process of Knowledge Discovery in Databases (KDD) followed
by an introduction to the important pre-processing tasks and the core mathematical
fundamentals behind them. The purpose of this chapter is to provide the reader with an
understanding of the differences between data mining and data warehousing, introduce
the problems associated with pre-processing and the current techniques used to
accomplish the tasks of clustering and classification as dimensionality reduction
technique.
18
Chapter 6 gives an academic overview of intelligent agents, artificial intelligence and
multi-agent technology. The aim with the chapter is to acquaint the reader with the
concepts of agents and the way in which they are employed in systems to accomplish
tasks. This is followed by an introduction to Artificial Intelligence (AI) as a pre-cursor to
Computational Intelligence (CI).
Chapter 7 provides an overview of computational intelligence and identifies it as a
branch of AI. The purpose of the chapter is to introduce two methods within CI namely
swarm intelligence and genetic algorithms and to provide the reader with the core
concepts behind each algorithm. The model draws upon the mathematical
fundamentals.
1.5.2 Model and Implementation
Chapters 6 and 7 introduce the tools required to construct the pre-processing system
which will be defined in this section.
Chapter 8 gives an architectural overview of the EPS4DM model describing how multi-
agent technology is used to orchestrate the processes of extracting the relational data
into formats required by the classification and clustering processes and in what way the
results of these sub-processes are used to maintain the data mart.
Chapter 9 provides the proposed model for data extraction and denormalisation. The
purpose of the chapter is to introduce the concept of feature vectors and the way in
which the extraction sub-process will construct the vectors as required in the clustering
process.
Chapter 10 provides the proposed model for the clustering process and the necessary
architectural, algorithmic and mathematical models. The purpose of the chapter is to
give a holistic overview of the sub-processes required, namely: a PSO-based cluster
centre calculation, the method in which cluster quality is measured and the persistence
of the clustered data.
19
Figure 1.1: Dissertation structure
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational
intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
20
Chapter 11 provides the proposed model for the classification process which does the
dimensionality reduction and provides the architectural, algorithmic and mathematical
models required. The purpose of the chapter is to give a holistic overview of the sub-
processes required, namely: feature vector creation, chromosome mapping, feature
extraction via a GA, the calculation of the reduced set via a classifier and the persistence
of the reduced dataset.
1.5.3 Results and Conclusion
Chapter 12 discusses the results of the implementation as described in Chapters 8-11.
The purpose of the chapter is to provide insight into the performance and accuracy of
CI-based methods used in the prototype as well as to give a global feasibility of
maintaining an up-to-date mart of pre-processed data for the datasets. By way of
conclusion Chapter 13 summarises the dissertation, answers the research questions
posed and serves as a discussion for future work.
1.6 Conclusion
Chapter 1 introduced the purpose of the dissertation by providing a summarised
background and motivation for the study. The chapter also formulated the key research
questions that will direct the research in the dissertation.
The chapter provided an overview of the structure of the dissertation by discussing the
relevance of each chapter to follow. The goal of this dissertation is to propose a data
mining pre-processing system which will maintain a mart of pre-processed data for the
selected dataset.
Chapter 2 provides an insight into what data organisations require in order to function
as well as a brief overview of the history of database technology as it has evolved over
the past half-century.
21
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
22
Chapter 2
Data storage evolution
2.1 Introduction
With the advent of inexpensive multi-gigabyte storage devices and broadband network
connectivity options, organisations, businesses and individuals are collecting data at an
alarming rate. The sheer volume of data has become so large that it is practically
impossible to visualise the available data, let alone, analyse it.
Human-generated data (i.e. documents, letters and records) constitutes only a small
fraction of the overall body of data. Electronic systems such as online searches and
banking transactions are responsible for most of the data. This data is accumulated in
the hope that it might be of use at a later stage, but according to Witten there is an ever
growing gap between data and the understanding thereof [Wit05].
There is an estimated 473 billion gigabytes of data on the Internet (correct at time of
publishing) [Wra09]. If you bind this data in books and stacked them one upon another,
it would stretch ten times from Earth to Pluto. Putting this into perspective, the
distance between Earth and Pluto (depending on where on the elliptical orbit Earth
finds itself) is anywhere between 4.2 and 7.5 billion km [Kai08]. Clearly the industry is
faced with astronomical amounts of data, with no tangible means of analysing or using it
all.
2.2 Background and context
By looking at any profit-based business, one can conclude that the key driving factor
would be to increase operating revenue and profits. The only viable method to achieve
this goal is to have information relating to the operation of the business readily
available. Organisational data constitute raw facts; and is often uncorrelated and
difficult to understand or read. Information is the result of processing these raw facts;
thus producing data that has meaning and can be understood.
23
Most companies and organisations are well equipped to collect and store vast quantities
of data relating to the daily operation of their businesses. The difficulty lies in the
analysis of this data and the conversion of the masses of data into useful information,
because this is pivotal to the decision support structures of the entity.
The first question to ask in this regard is: “Where and how is the data stored?” In order
to understand the Information Systems of today, it is necessary to examine the
Information Systems of yesteryear in more detail. The origins of Decision Support
Systems (DSS) – the name given to systems that aid business decisions within an
organisation – and Data Warehousing dates back to the early 1960s, when Information
Technology was still in its infancy [Inm05].
Gupta states that there is a remarkable number of current organisations’ business data
that still reside on legacy mainframe infrastructure. The main reason for this is that
these systems provide the core day-to-day operation of the business and that the risk of
moving data over to modern infrastructure simply outweighs the perceived gain
[Gup10].
In the 1960s the computing environment existed primarily of standalone programs that
made use of a so-called master file. This master file was a flat file that was stored on
magnetic tape, and accessed by a program that was created using punch cards. The data
on the tape-based storage medium had to be accessed sequentially, meaning it had to be
read from beginning to end to access data [Inm05].
Around 1965, the sheer number of master files in use by a typical organisation and the
amount of tape storage required to store these files, became impractical. Access times of
20-30 minutes per tape were not uncommon [Inm05].
The hardware requirements that were needed to house and access the tape media, the
slow access speed of the medium, the inherent complexities of creating and maintaining
programs and the creation of redundant data became significant problems of this era
and alternative solutions were sought to improve the manageability of data [Inm05].
The systems of this era and those following it, were mainly dubbed Management
Information Systems (MIS), as these were systems that provided managers with reports
24
relating to the accounting and transactional data of the organisation [Pow03, Lau10,
Uma13].
The 1970s saw the advent of the magnetic disk storage. In this technology, files stored
on it could be accessed by record number – meaning that it was no longer necessary to
sequentially access data. The user could access a specific record of data quickly and
easily by “jumping” to the address of the record on the disk. This reduced the access
time of data from several minutes to a few seconds. Programmers also created the first
Relational Database Management System (RDBMS) during this time. The essence of this
piece of software was that it could manage the storage of data to the disk medium via
indexes [Inm05, Rob12, Gra12].
The majority of Information Systems of this time made use of IBM's mainframe
architecture. Most programs were written in the Common Business-Oriented Language
(COBOL) programming language, a widely used programming language at the time,
while IBM’s DB2 was used to maintain data [Gup10]. The name given to IBM
mainframes from that era was the System/370™. This system built on the successes of
the System/360™ architecture, which was used during the 1960s by several
organisations; and this was also the architecture used by NASA’s Mission Control during
the 1969 Apollo 11 landing [IBM07].
The advent of the Database Management System (DBMS) resulted in the first real
database. The DBMS was able to provide a single, consolidated source of data for all
processing by applications – which no longer required the use of master files [Inm05].
Another move forward was the creation of the Online Transaction Processing (OLTP)
system. This system provided advanced database features such as concurrency control
(multiple access to the same data by various sources) and multi-threading
(independently controlled sections of instructions that can potentially be executed in
parallel). The OLTP system’s main function was to provide enhanced transaction
processing capabilities for organisations [Har08].
The 1980s saw the introduction of the first Personal Computer (PC), which in terms of
data management could allow the end user to manage and own data electronically. By
25
this time, organisations created massive OLTP systems that managed large quantities of
data. The need quickly arose to have some means of extracting specific information
from these systems. Thus, programmers created the so-called extract programs. These
programs would search the database for specific data that matched the criteria
provided by the user. This data would then be moved to another source, such as a
client side database or file perhaps. This would allow users to retrieve the relevant data
that they required [Inm05].
As shown in Figure 2.1, the child databases represent copies of relevant data from the
parent databases, while the arrows show that the flow of data is processed by the
extraction program.
Figure 2.1: Visualisation of extract programs creating child databases [Inm05]
This duplication of data to other sources is fundamental to a data warehouse, and will
be discussed later. It is worth mentioning that these systems work on the principle of
data residing in a flat file or table format. This implies that no dimensional relation
exists between data. The extraction programs have to search for relevant links in order
Parent database
Parent database
File
Child database
26
to obtain useful information from the system.
The most significant progress relating to databases and data warehousing was however
realised in the last decade of the previous century. It was during this time that the
foundation of modern database systems was laid, the most prominent being:
the underlying models of data,
the structural environment of database,
the way data was viewed, and
metadata (data about data) [Inm05].
The next section will discuss the relational database model that revolutionised the way
in which data was modelled, viewed and processed.
2.2.1 The Relational Data Model
The most defining change in the modelling, processing and viewing of data was the
move to the relational data model. A cornerstone concept of the relational model was to
view data as entities. In 1976, Peter Chen introduced the concept of an Entity-
Relationship Model (ER Model) which models entities, their attributes and the
relationships they participate in. Chen was describing the concepts formulated by Codd
in an earlier publication (which will be discussed later) regarding the relational model
of data [Hal08].
Since its inception many different notations have been used to depict the relationships
that exist between entities. However, the basic semantics remain the same and a simple
ER Model will be used to illustrate how data is stored in databases.
The classical example of students who attend classes offered by lecturers will be used.
Students attend individual classes that are taught by lecturers, who may have one or
more classes to present. Likewise, each student has one or more courses consisting of
classes, which that student has chosen. This is a typical definition of a university
domain.
It can be deduced that the following entities will be required: student, lecturer, class
27
and course. They have specific attributes and are related in some way. An example of
these types of relationships will be depicted in Chapter 3 by means of an Entity-
Relationship Diagram (ERD).
2.2.2 Segregated environment
The data storage mechanisms mentioned thus far primarily contained the operational
data of organisations (transactional data). One refers to them as operational databases
and performance is fundamental to their design.
They are however not well suited for providing useful and correlated information to
end users. Programmers developed On-Line Analytical Processing (OLAP) tools to
address this limitation. OLAP tools are specifically designed to analyse data in ways that
running SQL queries on relational models cannot. OLAP is a tool suite that allows data to
be modelled and analysed in the same way the user would think about it. Data is not
broken up into different tables; but dimensions of data are grouped [Har08]. However,
these tools tend to be performance intensive and it is therefore not always a viable
option to analyse the operational data of the organisation. Chapter 4 presents an in-
depth discussion of this.
In terms of database architecture, there was also a shift from mainframe driven
architectures to client-server based approaches, which made the use of multiple PCs for
data management within the organisation more attractive [Har08].
2.2.2.1 Database clustering
A database cluster is a collection of hardware and software that is collectively referred
to as servers that work together as a single system. Although the physical hardware
architecture may reside in various locations, the system serves client requests for data
in the same way than if it were one physical hardware-based system. The building
blocks of a cluster are nodes, an interconnected medium and a shared storage medium.
Each node is a standalone system, with its own processor, memory and operating
system. All the nodes connect to and share the data provided by the shared storage
[Doh03].
28
There are two distinct types of database clusters, namely:
i. Shared nothing clusters
This architecture allows database files to be segregated and distributed on each node.
Thus, the database as a whole is broken down and each node is responsible for only a
small sub-section of the data. The obvious disadvantage of this is that if one or more
nodes fail, only a sub-set of the entire database will be available [Doh03].
Figure 2.2: Visualisation of a shared nothing cluster [Doh03]
ii. Shared disk clusters
This architecture allows database files to be logically shared among a collection of
interconnected nodes, where each node has access to all data. As seen in Figure 2.3 this
can be accomplished by physically linking the database hardware in a network and
allowing servers access to all these data sources. Alternatively an operating system
layer can be used to abstract the viewing of all data sources. The obvious advantage is
high fault tolerance, as each node has access to all data [Doh03].
The modern database environment has shifted from a DBMS (which managed single
records on a disk) to a segregated environment. Figure 2.4 shows the segregated
database environment consisting of four main components: The operational database,
the data warehouse, the data mart and the individual layer (or temporary database as
depicted in Figure 2.3).
Server Server Server Server
29
Figure 2.3: Visualisation of a shared disk cluster [Doh03]
Figure 2.4: The segregated database environment [Inm05]
As seen in Figure 2.4 there is a linear order to the flow of data between each component.
The reason for this is due to the nature of the data being stored after it has been
processed. The arrows indicate that the data is derived from a specific type of database
in order to store it in one more suited to its purpose. Therefore, the data mart and the
temporary database are smaller subsets of the operational and data warehouse. Each of
these components has specific characteristics related to what type of data can be stored
in order to address a specific need.
The operational level, as mentioned, contains the operational database, which has to be
a high performance system that allows transaction processing for many users. Typically,
Operational Database
Data Warehouse
Data Mart
Temp DB
Servers
Network
Data storage available on the network
30
the database stores primitive application data (data at the working level, such as a
client’s current account status).
The data warehouse level mostly contains “integrated, historical primitive data that
cannot be updated”. This means that it contains derived data that can show trends and
patterns, as it is time-based (for example, this can show a client’s payment and purchase
history over a specified period) data that can perhaps be used to produce a credit rating
[Inm05].
The data mart level contains data that is specific to a department within the
organisation and is tailored to their needs (for example, the Product department might
want to determine which products have the highest sales over a specified period in
order to determine which are successful and which are not).
There exists a fourth layer as well, namely the individual level, which contains
temporary, ad hoc data that end users require to perform an analysis on the data. For
example, the end user might want to know what trends he/she can foresee in the
different types of products that produce the most sales.
2.2.3 Viewing of Data
As mentioned previously, the distributed environment allows transactional data to be
kept separate from analytical data. The fundamental difference between the two is the
workload that they support. Operational databases (that process transactions) make
use of the classical OLTP system and handle the organisation’s core data. The sales
division of a department store will for example keep client account details and a
telecommunications company will need to keep track of client bill records and call data
records. In order to make sense of the data, it has to be compared and analysed
[Wre07].
Thus, business data needs to be organised in such a manner as to make analysis easy
and for this OLAP is used. It stores data in optimised structures commonly following a
multidimensional approach. Data is related in many ways, for example, sales are linked
to a customer and to a product. In OLAP, a structure like this will be compared across
31
each dimension. These complex structures will be explained in Chapter 4 [Sch10].
2.3 Conclusion
Database systems have evolved quite a bit since their inception almost half a century
ago. Data remain a core asset to any organisation, business or individual. Without data,
the IT industry would not exist, nor would any other industry be able to operate
effectively.
Many technical and architectural changes have been perfected throughout the past
century; and as a result, large amounts of data can effectively be managed by computing
systems.
The fundamental problem faced by organisations is the sheer volume of data that they
generate and house. They need all this data in order to make strategic decisions, but
there is simply too much to make sense of it. Thus, the analytical components were
moved to OLAP systems that represent data in multidimensional structures, making the
analysis of data easier.
32
1. Introduction
2. Data storage evolution
3. Database models
4. Data warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
33
Chapter 3
Database models
3.1 Introduction
Chapter 2 introduced the data storage techniques employed in some of the very first
information systems. These techniques mostly involved flat files that were tightly
coupled with a specific programming code to extract the required data.
In order to fully come to terms with data warehouses, data marts and other modern
storage facilities used within the data mining environment, some core database aspects
must be examined. This chapter provides an overview of database models and reveals
some of the problems that still face data storage today.
3.2 Database models
A relational database has to abide by the Information Principle. The Information
Principle, or better named, The Principle of Uniform Representation, simply states that
all information within the relational database is represented as, and only as, relations
[Dat08, Dat12].
To truly understand what a relation is and why it is needed, an examination of the
history of database modelling is required. A database model can be regarded as the
blueprint for a database. It defines the structure and organisation of data within the
database [Pow06].
3.2.1 The file system models
This section provides a summary of four file system models namely master files, the
hierarchical model, the network model and spreadsheets.
i. Master files
In Chapter 2 it was stated that the earliest form of data storage was the so-called master
or flat file. The term flat file means that all data is kept within a simple file that has no
34
topological structure. These are often text files, but they need not be. Earlier
implementations used binary files, which took less storage space [Pow06].
The difference between binary and text data is essentially only the way in which it is
encoded, i.e. the format used to create the file. A text file is a file that is usually encoded
with the American Standard Code for Information Interchange (ASCII) format. ASCII
assigns 7-bit (binary digit) numeric values to characters, including the common
alphabet, punctuation marks and numbers. The character ‘a’, for example, is
represented by the numeric value 97, which has the binary value of 1100001. Other text
file encodings include UTF-8 and on older mainframes the Extended Binary Coded
Decimal Interchange Code (EBCDIC) format was used [Mic02]. Thus a text file, which
can be viewed as plain text on a computer, is not encoded as the text itself but as
mapped numeric values.
A binary file uses full bytes for each ‘unit’ of information as opposed to the 7-bit ‘unit’ of
ASCII. Each program created its own binary file. These files were often unique to that
particular program. When text editors view files of this nature, they attempt to map the
numeric values to characters (by using a default character encoding), often ending up
displaying a multitude of illegible system characters. Binary files also often contain
header information, which defines the data found in the file [Mic02]. Powell states that
this kind of database is often called a file system database, as it consists of many files
that are stored on disk drives. Any topology between data in the files (i.e. relations) has
to be searched for programmatically by the applications that use them [Pow06].
Table 3.1: Example flat file showing login times of employees
Record Name Gender Department Last Login 0001 Joe Smith Male Finances March 12, 2010 08:30:45 0002 Susan Milian Female HR March 12, 2010 08:31:33 0001 Joe Smith Male Finances March 13, 2010 07:33:41 0001 Joe Smith Male Finances March 13, 2010 12:45:35 0002 Susan Millian Female HR March 14, 2010 09:30:32
In Table 3.1 a sample flat file is presented and it is clear that the file contains data that
has been duplicated.
35
Authors refer to this phenomenon as Data Redundancy and it will be discussed in
Section 3.2.2 [Rob07].
ii. Hierarchical model
The hierarchical model is a tree-like structure which contains logically inclusive data.
The tree structure is inverted with the root node at the top, and child nodes branching
from it. It is important to note that each child node can have only one parent node, but
each parent node may have multiple child nodes. The one-to-many relationship between
parents and children is introduced here [Pow06, Ste09, Dat12].
Figure 3.1: Example hierarchy of organisation [Pow06].
A typical example of a hierarchy is the organisational organogram shown in Figure 3.1.
In terms of computing the file structure on the disk drives is another example of a
hierarchy. It typically contains directories and folders that logically group related files
together. Another good example is the Microsoft Windows Registry, which is composed
of so-called hives of data that store system and application related data in a hierarchical
structure [Ste09].
Halpin states that data stored in child nodes are groupings known as records. A record
Organisation
Department
Manager
Project
Task
Employee
36
is a collection of fields (or facts) that group an entity together. Figure 3.2 depicts a
typical record structure containing the data types for each field of the employee
information shown in Table 3.1 [Hal08].
iii. Network Model
The network model is essentially a hierarchical data model with an exception to the
parent-child rule. This being that the parent–child relationship also allows for many-to-
many relationships. This implies that a parent may have multiple child nodes and a child
multiple parent nodes [Pow06].
Figure 3.2: Example record structure of an employee [Hal08]
According to Stephens, the network model can also be seen as a directed graph as the
existence of the many-to-many relationship allows for complex structures to be created.
Network database models are rarely used today and a more natural use for this data
model is internal memory representations for applications which examine problems
relating to graphs. One such an example would be the Travelling Salesperson Problem
(TSP), which aims to find the shortest path in a graph connecting towns as seen in
Figure 3.3 [Ste09].
Figure 3.3: Example graph of cities showing distances [Ste09]
inte
ger pers_nr
strin
g emp_name char gender
strin
g dept
times
tam
p login_time
0001 Joe Smith M Finances 12/03/2010 08:30:34
A B
C
D E
12
14
9
16
12
11
37
iv. Spreadsheets
Although spreadsheets are not databases per se, they do form a fundamental part of the
background of computer data storage and manipulation. A spreadsheet contains tabular
data in the form of rows and columns, a similar format to a database table, but stored as
a binary or ASCII flat file. This file is often application specific. Spreadsheet software
allows the end user to manipulate the data by creating formulas and charts, and to
export the data to other formats. It also supports part of the analysis by means of
statistical functions or solution finding tools [Ste09].
Spreadsheets are a common tool of choice for financially oriented people. Before the
late 1970s people had to rely on mainframe computers and normal hand-held
calculators to perform any financially based operation. The necessity of having a
singular tool to perform these calculations drove its creation. It was in 1978 when Dan
Bricklin and Bob Frankinston created the first electronic spreadsheet, named VisiCalc.
Much of their original ideas, such as the row and column layout and syntax used to
compile formulas, formed the basis of modern spreadsheets [Wal07].
A user can download the original VisiCalc and will still, after 30 years, run in an MS-DOS
window in a modern Microsoft® operating system, such as Windows XP. Other major
spreadsheets of note are the Lotus 1-2-3, Quattro Pro and Microsoft® Excel. The latter
product, Excel, was designed for the Macintosh and was released in 1985 as a successor
to MultiPlan – Microsoft’s first spreadsheet that was launched in 1982 on the MS-DOS
platform [Wal07].
3.2.2 Disadvantages of Traditional File System Models
The data models described to this point rely on the traditional file system paradigm,
where groups of separate files are managed independently, regardless of which data
model it employs [Gup09].
The most notable disadvantages of traditional file systems are [Gup09, Bur10]:
i. Data Redundancy – This arises when data has to be replicated across separate
files to allow the applications using them to function properly. This increases
38
storage requirements and more complex programming to keep all relevant files
up to date. Customer details, for example, might need to be saved in the customer
as well as the sales file.
ii. Data Inconsistency – Data redundancy will inevitably lead to data
inconsistency. Data inconsistency occurs when, data relating to a specific entity
are not updated in a consistent manner. This will result in one file having
different and conflicting data when compared to another file. Over time, this can
lead to serious degradation of the quality of data.
iii. Data Integration – When queries on data that reside in multiple files need to be
handled, complex programming needs to be done in order to extract and compile
the information needed.
iv. Data and Program Dependence – The physical location of the files that contain
data needs to be programmatically assigned, thus the application is dependent
on its data source. The files are also program dependent in the sense that the
application relies on the structure and organisation of data within a file.
v. Data Control and Sharing – Since multiple files contain data about the same
elements, and these files reside in different places for each application’s use, the
system is decentralised. Due to this decentralised nature, the sharing of data
between different applications once again requires the creation of specialised
programs which compile the data required.
This list is by no means complete, but only serves to show the problems that needed to
be solved by future models.
3.2.3 The Relational Model
i. The father of relational databases
It was the pioneering work of E.F. Codd, an IBM researcher, who in 1970 successfully
addressed the issues posed by the traditional file system databases. The original idea
proposed by Codd in his revolutionary paper “A Relational Model of Data for Large
39
Shared Databanks” was to build smaller repositories of data from a larger set, and link
them via relations. He proposed that subset data be stored in a relation, with fields and
entities [Pow06].
Codd posed 12 principles to which a relational database needs to abide. Looking back at
what was stated in the opening paragraphs of this chapter, the Information Principle is
in fact Codd’s first rule (also iterated below).
Davidson et al. presents a practical overview of Codd’s 12 rules as implemented by
modern RDBMS (Relational Database Management System) vendors. The major players
include Microsoft® SQL Server and Oracle 11g. The Information Principle and the
Guaranteed Access Rule are regarded as the fundamental rules governing modern
RDMSs. These two concepts will be discussed next [Dav08, Dat12].
a. Information Principle – All data that is stored within a relational database must
be represented in the same way, and that is to store it in a column format. Data
may only be accessed by referring to the column name of the table in which it
resides.
b. Guaranteed Access Rule – Any atomic value of data will be accessible by
referring to the table name in which it resides, the primary key of its record entry
and then the column name. This implies that there is no order on either the
columns or rows of a table. Columns can be shifted around within a table and not
affect how it is retrieved. Rows are uniquely identified through their primary
keys and their order within the table is therefore arbitrary.
ii. Database relational theory
Before a rundown is given of how modern relational databases work, the fundamentals
of database relational theory need to be covered.
a. Heading
A heading is a collection of attributes. Where an attribute consists of a name, A, and a
data type, T. The rules governing a heading are:
40
Each attribute in the heading is distinct in its name A.
Each distinct attribute is of the same data type T.
The number of attributes in the heading is referred to as the degree of the
heading [Dat08].
An example heading for a table storing a student’s data might be:
{< 퐼퐷, 푖푛푡 >; < 푁푎푚푒, 푠푡푟푖푛푔 >; < 푆푢푟푛푎푚푒, 푠푡푟푖푛푔 >; < 퐶표푢푟푠푒퐼퐷, 푖푛푡 >}
b. Tuple
A tuple can be defined as an ordered collection of elements. When the tuple has exactly
n elements it is referred to as an n-tuple and mathematically it can be represented as
(푥 , … ,푥 ). Continuing our earlier example of the student table, an n-tuple of length 4 is
for example: {1, 퐽표ℎ푛,퐷표푒, 22}
Let T = 푠 , 푠 be defined as sets. The Cartesian product of 푠 and푠 is defined as:
푠 × 푠 = {(푥,푦)|푥 ∈ 푠 and푦 ∈ 푠 }
Hence the Cartesian product is the set of all ordered n-tuples where each element comes
from the corresponding set [Dat08]. Thus for the collection of sets 푠 , . . . , 푠 :
푠 ×. . .× 푠 = {(푥 , … , 푥 )|forall푖in(1 ≤ 푖 ≤ 푛),푥 ∈ 푠 }
c. Relation
A relation, 푅 exists if it is a subset of the Cartesian product 푠 ×. . .× 푠 . R is hence an n-
ary relation over 푠 ×. . .× 푠 . Where n=1 it is referred to as unary, n=2 as binary and
n=3 as a ternary relation [Dat08].
iii. Modern relational databases
Now that a fundamental understanding of relations has been established, its application
within modern relational databases can be examined.
Table 3.2 compares important modern database terminology at the design and
41
implementation level to the original terminology. Reference will be made to modern
terminology within this section [Dav08].
Table 3.2: Naming conventions
Relational Theory Design Model Implementation Model
Relation Entity Table
Tuple Instance Row
Heading1 (Collection of) Columns (Collection of) Columns
Heading entry Attribute Field
In light of the naming conventions introduced in Table 3.2, the remainder of this
dissertation will make use of the Implementation Model convention.
d. Candidate Keys
It has been established that a table may not have duplicate rows. The way to implement
this within a database is to ensure that each table defines a candidate key as one of its
attributes [Dat08, Dav08, Dat12].
A candidate key is an attribute (or combination of attributes) that can uniquely and
unambiguously identify each instance of an entity. To enforce the Guaranteed Access
Rule a candidate key is needed to avoid duplicate instances, and to correctly locate an
instance within the database, as the ordering of elements in a table is arbitrary. There is
no upper limit to the number of keys that a specific table element may need in order to
maintain its uniqueness [Dav08].
e. Primary Keys
The primary key is the candidate key which is chosen to uniquely identify each row. As
mentioned, a candidate key may be a combination of attributes that uniquely identifies
the row. Any candidate key chosen beyond the primary key is referred to as an alternate
key.
1 Heading information for modern databases is found within the System Catalog.
42
f. Foreign Keys
Foreign keys are used within a relational database to ensure that:
1) Duplicate values are not present within tables.
2) Links exists between entities. A foreign key is a candidate key of an entity which
establishes a link between that entity and one or more other entities. This ensures that
when one entity requires data from another, it will only require the key to access this
information, eliminating the duplication The foreign key will most commonly link to the
primary key of another entity, but it may, alternatively, link to an alternate key, in which
case it is called an optional foreign key [Dav08].
It is worth mentioning that according to Codd’s 12th rule, the Integrity Independence
Rule, every non-optional foreign key must be accompanied by an existing primary key in
the entity that it is related to. This means that the foreign key cannot enforce referential
integrity if the primary key it relates to does not yet exist. The row, once linked, may
not be deleted without deleting all the rows referencing the primary key first [Dav08].
Furthermore, modern relational databases may not require the creation of a primary
key for the data within a table, thus not explicitly enforcing Codd’s rules. This is
required in situations where data needs to be imported from external sources such as a
text file or a spreadsheet file, that do not contain any topological structure. The entire
row is then regarded as a primary key. The raw data can be loaded into a database table
and are easily manipulated there in order to transfer the data to a relational table. The
DBMS will treat the entire row as a primary key for the imported data [Dav08].
g. Cardinality
The use of foreign keys within a database environment, introduces another concept,
namely that of cardinality. Cardinality simply indicates how many instances of one
entity relates to another [Dav08].
i. One-to-Many (1:M) – These are the most common types of relations found
within a database environment; this is where one instance of an entity may link
to any number of instances of another entity. An example is the use of one
43
company vehicle by many employees [Rob07, Red12].
ii. One-to-Exactly N (1:N) – Less commonly used, these rely on constraints being
setup; such as one employee who has exactly 3 email addresses in the company.
The most common is the one-to-one relation, which relates exactly one instance
of an entity to only one other instance of an entity [Rob07, Red12].
iii. Many-to-Many (M:N) – These are real life relations, which link more than one
instance of an entity to many other instances. By way of example: In an
organisation, employees have many job skills, and in turn these job skills can be
learnt by many employees. Thus, many skills can be learnt by many employees
and vice versa. These relations cannot simply be implemented in a relational
data model; they require the use of a bridging table. The basic idea is that instead
of the candidate keys (foreign key in one table linking to primary key of another)
linking to each other’s tables, they are both placed in a new table, which must
contain these candidate keys that sometimes have additional summary
information [Rob07, Dav08, Red12].
The process of normalisation eliminates duplicate entries in data and establishes the
correct relations between the different entities. This process will specifically separate
the data into the correct number of tables, create the correct bridging tables and
introduce the required candidate keys [Rob07, Red12]. Chen’s Entity-Relationship (ER)
model allows the modelling of entities and the relationship which they have with each
other. An entity is defined as something that is capable of existing on its own within a
system and could be a real-world physical object such as a person. An entity can be
thought of a noun [Hal08].
Figure 3.4 shows the use of a foreign key. The employee Joe Smith is linked to his job
description (JD) by referring to the JD primary key in the personnel table. (Job_Desc_ID
is the primary key of the job description table, and is used as a foreign key in the
personnel table.)
3.2.4 Object-oriented model
The fundamental idea behind a computer object is to model the real-world
44
environment. In the object-oriented (OO) data model, an object represents an entity that
is similar to an entity that contains the data required to describe it. However, it also
contains information about the object’s relationships to other objects.
Figure 3.4: Example use of foreign key
The data contained in the object is given the semantics (description of its data or meta-
data); and hence this model is commonly known as a semantic model of data [Rob07].
A computer object is said to be an encapsulation of both its data and the operations that
can be executed on that data. This has been extended to object-oriented databases,
therefore an object in the database also contains valid operations which can be executed
on it. These operations are called methods. This model supports the concept of
inheritance, whereby a child object may inherit attributes and methods from its parent
object, this ensures greater data integrity [Rob07].
Figure 3.6 depicts the personnel data that was presented earlier, encapsulated as an
employee object.
The first block represents its attributes, the second block its relations (the link to the job
title as shown in Figure 3.4) and the last blocks, the operations that can be performed
on the object. An inheritance structure is also depicted showing the different types of
the base object Employee.
45
Figure 3.5: Example of a normalised database structure
46
The use of this type of database is only recommended if it is used in conjunction with a
modern object-oriented programming language. The object-oriented database is also
managed by an Object-Oriented Database Management System (OODBMS) [Ste09].
Figure 3.5 depicts the employee from Figure 3.4 as a lecturer who gives courses that
many students can take.
3.2.5 Object-relational model
Object-oriented databases face a great challenge as they require a mind shift from the
traditional view of data, and this introduces problems.
Figure 3.6: Example OO data model
The idea behind the object-relational model is to combine the best traits of both data
models. The best features of the object-oriented data model have been incorporated
47
into the object-relational model, while still maintaining the familiar relational structure.
An Object-Relational Database Management System (O/RDBMS) manages the object-
relational database [Rob07, Red12].
A programming language construct known as object-relational mapping (ORM) is
gaining ground in modern systems development. It is used to map between program-
specific object-based data and the relational data within a traditional relational
database. It is essentially a layer between the program code and the data in the
database. The main motivation behind this is that it simplifies and speeds up
development. It leaves complex SQL queries to be constructed by the ORM layer –
freeing up the developer from having to perform complex join operations to extract
data. Frequently used ORM frameworks in the Java programming environment include
Hibernate and iBATIS [Ste09, Hib13, Iba13].
3.2.6 XML data model
XML (Extensible Mark-up Language) is a language designed specifically for representing
data. Its fundamental design was to store hierarchical data, but it is also used to store
data of a tabular format [Ste09].
An example of the personnel data presented can be represented in XML in the following
way:
<Employee JobTitle=”Accountant”>
<PersonnelNumber>0001</PersonnelNumber>
<Name>Joe Smith</Name>
<Job_Desc_ID>001</Job_Desc_ID>
</Employee>
XML is the fundamental protocol used to transfer data between networked and
internet-based systems. Modern databases, such as Oracle 10g incorporate support for
XML in order to store and manage “unstructured” data of this type. Unstructured data is
48
seen when no explicit relations exist [Fel07]. Native XML databases also exist, which
only store and manage XML-based data. In the section on the modern relational
database, it was mentioned that text files and spreadsheets are used to import data into
relational databases. This provides an alternative, whereby XML data can be imported
and managed separately or translated to relational data [Rob07].
3.2.7 Data manipulation
The language needed to access the data within the database is an important aspect of
database theory. According to Codd’s 5th rule, a language that can perform and enforce
“data definition, view definition, data manipulation (interactive and by program),
integrity constraints, and transaction boundaries (begin, commit, and rollback) is
needed” [Dav08].
The Structured Query Language (SQL) – pronounced “sequel” – consists of a Data
Definition Language (DDL) and a Data Manipulation Language (DML). A DDL is used for
operations such as defining the structure of tables and their relations to each other, for
instance Create Table or Alter Table. A DML is used for all operations on data in a
database. It has four fundamental statements, namely: Insert, Select, Update and Delete.
In database communities, these are informally referred to as CRUD – Create, Read,
Update, and Delete [Dav08, Red12, Cel12].
The following SQL statement will retrieve Employee 0001’s Name from the Employee
table, as depicted in the previous examples:
Select Name
from Employee
where Personnel_Number = ‘0001’;
Update Employee
Set JobTitle = Manager
where Personnel_Number = ‘0001’;
49
3.3 ANSI/SPARC architecture
In the 1970’s, the American National Standards Institute (ANSI) Standards Planning and
Requirements Committee (SPARC) defined in what way data should be viewed as layers
of abstraction. They essentially defined four views (or abstractions) of data, namely, the
External model, the Conceptual model, the Internal model and the Physical model as
depicted in Figure 3.7 [Rob07].
External model – This refers to how the end users of the systems see the data,
these are often business users.
Conceptual model – This is a global graphical depiction of the database
structure, and is usually an ERD – Entity Relationship Diagram.
Internal model – The conceptual model needs to map to the internal model used
by the specified DBMS. For example, a RDBMS uses relations and an OODBMS
uses objects.
Physical model – This is the lowest level, which defines how data is to be stored
on physical mediums such as disk drives. It is “a software and hardware-
dependant” layer that has to conform to the standards used by the DBMS, the
native operating system and the storage hardware provided by the system
architecture [Rob07, Gra12].
3.4 Important database concepts
This section will discuss the physical layer of the ANSI/SPARC architecture. Each sub-
section introduces a concept used within a DBMS to manage and store its data.
50
Figure 3.7: ANSI/SPARC architecture
The physical layer is described in more detail in Section 3.4.
3.4.1 Schema
A database schema can be seen as the first level of security. A database schema is a
grouping of related database objects (such as tables, indexes and views) which only
authorised users can access and view. An entire organisation’s database tables can be
contained in one physical database and not every user may have access to view all the
data in all the tables. It should be mentioned that the nomenclature referring to the
concepts described here is vendor specific. Some DBMSs use the term schema (for
example Oracle®) and others make use of the term users (for example MySQL and
PostgeSQL) [Rob07, Ora11].
3.4.2 Virtual Table (View)
A database view is a query defining an expression that references a set of tables. The
expression is a select statement that is used frequently. This query will be stored in the
database as a view, which defines a virtual table for the data. Its output changes each
time the referenced table’s data is updated. Native Insert, Update and Delete operations
to columns referenced by a view are not supported, only a Read operation is allowed.
External Model
Conceptual Model
Internal Model
Physical Model
End Users
DBMS
Database Designer
Low-level System
51
However a database trigger could be setup which can handle Insert, Update and Delete
operations on a view [Dat08, Rob08, Ora11, Dat12].
3.4.3 Metadata
Although the database environment is not the only place where metadata might occur, it
is of vital importance in database systems2. The often non-explanatory definition of
metadata – which is said to be data about data – never describes its real use or purpose;
and is a term that often creates confusion in the computing domain. There are three
metadata classifications, namely:
Descriptive metadata – This is identification data, which describes how other
data is to be discovered and searched for. An example is a library catalogue
system, used to index and locate books.
Structural metadata – Objects are often super-objects made up of other objects.
This describes how each object is put together using the other objects.
Administrative metadata – Data that almost always needs to be managed, thus
this will provide information describing technical details, access rights, file
creation data and so forth [Gup09].
3.4.4 Data dictionary
The data dictionary stores metadata about the database and is often automatically
managed by the DBMS. The following types of metadata is often stored:
Physical database design – This describes technical details such as file names,
access paths and which storage data structures are used. It also provides a listing
of all tables found within the database.
Database schema – A complete description of the schemas that can be found in
the database is maintained together with valid users and their associated access
rights and views. [Gup09].
2 Metadata such as the schema, data dictionary and system catalogue is often stored within the RDBMs as relational tables.
52
3.4.5 System catalog
The system catalog can be regarded as a more detailed data dictionary. It describes all
the objects found within the database in detail, thus for a given table, it will describe the
table header, data types, column counts, creation times, and last update information to
name a few [Rob07].
As a result of the interwoven nature of the data in the data dictionary and system
catalog, modern DMBSs often produce only a consolidated system catalog. The terms
system catalog and data dictionary may be used interchangeably in modern database
terminology and both form part of the ANSI/SPARC conceptual and internal layers
[Rob07].
3.4.6 Indexing
The concept of indexing has been in existence for centuries: books have indexes and
page numbers and filing systems often have tabs to help users find what they are
searching for. Indexes consists of (key, pointer) pairs and are often stored in tree data
structures [Lig07].
A B+ tree is often the data structure of choice for modern DBMS indexing, and will be
described briefly:
Non-leaf index nodes contain 푝 tree pointers and 푝 − 1 keys. The reason being
that when a search is done keys are compared by either being smaller, equal to,
or greater than each other. Suppose a search is done for key 푘 by comparing
푘 to the key 푘 in the node: if 푘 is smaller than푘, then the pointer logically
located at position 푘 will be followed. Therefore, there is a pointer to the left
and right of every key that can point to a new non-leaf or leaf, which can reduce
the search space considerably [Lig07].
Leaf index nodes differ in the sense that they consist of 푝(key, data pointer)
entries and a pointer to the next leaf node. Leaf nodes are ordered in such a way
that they sequentially reference each row of data, and a data pointer is more
properly called a Record Identifier (RID). Thus if 푘 matches key푘, the RID at the
53
position푘will be followed to access the row of data required. If none of the keys
match, it means that a top-level index was routed to the wrong node and that a
linear search will be followed to find the key in the subsequent nodes. Although
the main reason for this ordering is not the fault tolerance of finding a row, but
actually the quicker finding of an ordered sequence of rows [Lig07].
As depicted in Figure 3.8, it is the intermediary non-leaf nodes that allow the data
indexes to be found quickly without any disk I/O overhead occurring. Indexing is part of
the ANSI/SPARC physical layer.
Figure 3.8: Example of an indexing tree [Red12]
3.4.7 Transactions and concurrency control
A transaction is a real-world unit of work, and it is a grouping of statements that can be
executed on the database. Capturing a sale will for example require entries in more than
one table and might cause updates to other data in related tables. Most operational (or
more aptly referred to in this instance as transactional) databases perform
transactional processing.
Transaction management is subject to the ACID (atomicity, consistency, isolation,
durability) principle, whereby it either completes all tasks or nothing. The ACID
principles will be explained in more detail later in the chapter [Gup09, Wya13].
Multiple applications will try to perform transactional processing simultaneously; and
might even work on the same data at the same time [Rob07]. Possible problems that
54
may arise, according to Gupta are:
Lost updates – Consider two transactions working on the same data. At the
same time transaction 푇 and 푇 read the value of field mark-up. Transaction 푇
uses the data in some calculation and updates its value. Transaction 푇 also
processes the data and commits its changes. The update made by 푇 is lost as a
result.
Uncommitted dependency – Consider transaction푇 which changes the value of
variable mark-up at time 푡 . At time 푡 , transaction 푇 , reads mark-up to calculate
actual cost. However, at time 푡 , transaction 푇 is rolled back as there was a
failure, thus the value read by 푇 is now invalid [Gup09].
Inconsistent analysis – Assume transaction 푇 is busy updating a set of data,
and updates entry 푒 at time 푡 . At time 푡 , transaction 푇 starts to perform an
aggregate calculation over the dataset updated by 푇 and uses entries 푒 and 푒 .
At time푡 , transaction 푇 is completed by updating 푒 ; and thus the aggregate
calculated by 푇 is based upon inconsistent data [Rob07].
Concurrency control is the technique used to control the concurrent execution of
transactions. It makes use of locking techniques3, where one transaction is in a locked
state until another has successfully completed and all involved statements have been
committed [Gup09].
For this reason, the management of all transactions is subject to the ACID (atomicity,
consistency, isolation, durability) principle. The main concepts are briefly summarised:
a) Atomicity – A single transaction can be seen as an atomic unit, either the entire
transaction is processed correctly or nothing is processed. If a failure at any level
(hardware, software, etc.) should occur – the state of the data must remain
unchanged. 3 Database deadlock can occur when one transaction is waiting for data which is still locked by another requesting transaction. For example a database session A requires the table Y to operate and hence acquires a lock on it. Session B requires table Z to be locked and does so, but also requires table Y to perform an update operation on it. Until session A releases the lock on table Y session B will need to wait. If session A were to require table Z for an update operation but session B has it locked and is still waiting for operation A’s lock both sessions are in deadlock waiting for each other. All-or-nothing resource allocation should be implemented to not allow session B to acquire locks until all the resources it needs are available.
55
b) Consistency – At the very best the DBMS must be able to determine if a
transaction will violate a consistency rule4 and either decide to not process the
transaction or to take some action setup by the database designer. An example is
submitting an insert statement by omitting a mandatory non-null field, which
could either result in the transaction failing or the DBMS inserting a default
value.
c) Isolation – A fundamental data access rule which ensures mutually exclusive
access to the data that is being modified only by the process that is performing
an update. No other transaction may execute a read operation on the data in the
process of being altered.
d) Durability – This is a guarantee by the DBMS to ensure that any transaction that
sent a success message to a client is indeed committed and will not be lost in the
event of any system failure. Transactions are usually written to a transaction log,
which can be used to re-create the state of the database in the event of any
system failure before the failure occurs [Ste09, Wya13].
3.5 Specialised database models
The following section presents a few other data models, but this is by no means an
exhaustive listing:
3.5.1 Document-oriented model
The document-oriented database5 is used in instances where users interact with
physical documents - they might want to view them or make amendments to them. A
normal file system does not provide the adequate searching, sorting and concurrency
control that is provided by proper databases. This database might store the contents of
the documents themselves, or merely have links to physical files on a disk [Ste09].
4 Changes to the database should always move it from one consistent state to another consistent state. 5 MongoDB and CoachDB are common document-oriented database implementations [Nos13].
56
3.5.2 Temporal model
A temporal database is one that explicitly keeps track of the validity of data. Each entry
will have a valid period associated with it. A good example is changing stock prices that
are depending on global events, for instance the 2011 Thailand floods that caused
computer hard drive prices to soar [Ste09, Itw11].
3.5.3 Spatial model
A spatial database consists of spatial data that describe the real world. In the real world,
there are either entities or continuous fields. An entity will represent a physical
occurrence such as a river, which has geographical coordinates and boundaries. A
continuous field is synonymous to a phenomenon that occurs within the world, such as
temperature changes or pollution levels within a defined boundary. A field’s data is
variable and will be updated from sensor readings or from experiential data [Man05,
Dar06].
3.5.4 Fuzzy object model
This model makes use of Zadeh’s proposed fuzzy set theory in order to represent
incomplete and unclear values imbedded into object attributes. Most real-world data
are incomplete and are often difficult to take into account. The set theory provides a
means by which this can be done [Zon05].
In the set theory, unlike the binary theory that defines truths as either true or false,
there is no exact truth. The state of an object is described relative to other states, thus
network speed might be measured in terms of slow, not slow, not fast, very fast, and so
forth. A set of IF-THEN rules are then followed to deduce actions for each state [Zon05].
3.5.5 Dimensional model
The dimensional data model deviates from the traditional relational model by
representing data in multiple dimensions of abstraction and will be discussed in
Chapter 4.
57
3.5.6 NoSQL model
The notion of NoSQL databases (or often referred to as not only SQL) is a collective term
for many different database implementation types. MongoDB and CoachDB are
document-oriented database implementations. Many other instances such a graph
databases, or Google’s wide column BigTable also form part of the NoSQL database
motion [Nos13].
3.6 Conclusion
Chapter 3 presented an overview of the relational database as being the database of
choice used by many application programmers when creating enterprise software. By
making use of object relational mapping (ORM), the use of relational databases have
been substantially made easier, allowing application developers to easily create
database driven applications.
The use of specialised database implementations and the latest technologies to solve the
problems inherent to the relational database does not solve all business problems. This
chapter has covered the modern approach to storing enterprise data, be it:
transactional, documents or geographical in nature. The data needs of organisations
change continuously and merely storing transactional and client related data is not
sufficient any more. Section 3.5 therefore presented specialised databases that allow
more complex data to be stored as part of operational systems. Chapter 3 has shown
that the modern enterprise environment consists of operational systems that need to
cater for a myriad of real-time transaction processing tasks, ranging from flight
bookings, credit card payments, call data record processing to the maintenance of client
related information at an insurance firm or bank. The nature of this data is complex and
many of it is generated by sub-systems that allow the enterprise to function. The
operational systems are entirely dependent on this data in order to service any user
requests that arise and without this data, the organisation would not be able to operate.
There is however a problem regarding the segregation of this data and the systems that
use it, namely that it is very closely coupled to the business unit which requires it and
thus it often benefits a smaller group of people who use application software reliant on
58
this data. Therefore, each business unit will have its own applications and data which is
required to function and there is often a limited global view of the data maintained by
all the business units.
Chapter 4 will present an overview of how the operational data that is used by various
business units stored in their operational databases and operational systems can be
made of use in the organisation as a whole in order to direct business decision making.
59
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
60
Chapter 4
Data warehousing
4.1 Introduction
Organisations today rely heavily on the use of relational databases, but as Chapter 4 will
show, they are not well suited for the analysis of data. Relational databases are
structured to allow OLTP systems to operate effectively to serve member requests.
The organisational business units will often require information pertaining to the
transactional data which is stored in the form of reports. Chapter 4 will provide an
overview of Online Analytical Processing (OLAP) that makes use of data warehouses
and related data marts in order to analyse the transactional data. Data warehouses are
not intended to replace operational databases such as relational databases, document-
based databases or other NoSQL databases described in Chapter 3, but rather to make
the data they store available for managerial decision making purposes.
Chapter 4 starts with an overview of Business Intelligence (BI), introduces the data
warehouse and its related concepts and concludes with an overview of the real-time
analytical capabilities of an OLAP system.
4.2 Business Intelligence
Business Intelligence (BI) is “having the right access to the right data at the right time”
in order to make informed business decisions. Data can be raw facts or information that
resulted from some kind of an analysis. In essence, this will allow management to make
decisions based on factual operational data within the organisation [Sta07, Fer13].
More formally, IBM defines BI as “the conscious, methodical transformation of data
from any and all data sources into new forms to provide information that is business
driven and results oriented” [Bie03]. The fundamental premise of BI is that all decisions
that management makes should be based on facts and not subjective opinions or
educated guesses. BI aims to provide the most accurate, up-to-date and relevant data at
all times.
61
This data is available for any person in the organisation who wishes to view it, not just
for management. [Lob09].
The database solutions discussed thus far are called Transactional or Operational
Databases and are often referred to as Production Systems. They most commonly (but
need not only) contain the transactional data of the organisation. It might for example
contain the latest sales and customer data. It would be impossible to simply keep all the
data ever captured by the organisation in the same database, hence older (or historical)
data is often removed from the production system after some time. Historical data is
however often the data required to make informed decisions [Sta07].
4.3 Data warehouse fundamentals
A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant
collection of data that is used to support managerial decisions. These aspects are
important fundamentals to data warehouses. The differences between these aspects
and those in an operational database, is presented in the following section. Formally, a
data warehouse as defined by William Inmon (known as the father of the data
warehouse) is an Analytical Database [Inm05]. The fundamental characteristics that
make up the data warehouse will be described in detail in the following section with
references made to the relational models as defined in Chapter 3.
4.3.1 Subject-oriented
Data in operational databases are categorised according to physical use. It reflects the
way applications use and require the data; it thus reflects the physical manipulation of
operational data. Another way of looking at it might be to say that it follows the
business processes within the organisation. Figure 4.1 gives an example of the subject-
oriented data in an organisation. It is clear from Figure 4.1 that the data warehouse
contains data related to subjects of the business domain, whereas the operational
database follows a process-based data approach [Sil08, Lab11].
The data warehouse is designed to answer questions from the disparate departments
within the organisation and is designed around the data required by these departments.
Thus, in Figure 4.1 the Sales data from the operational database might be stored as sales
62
per Customer in the data warehouse so that the sales department can quickly access the
data [Rob07].
Figure 4.1: Subject orientation [Sil08]
4.3.2 Integrated
The operational database is largely application-oriented, thus the data that it uses is
stored in formats required by the applications that use them. Multiple operational (or
sometimes other) databases feed data into the data warehouse , each of which contains
data in its own format. To avoid data translations and possible data lookups while
querying the data warehouse, the data needs to be represented in a uniform manner.
Data is therefore translated beforehand and the uniform data is then integrated into the
data warehouse [Inm05, Sil08, Kim13].
Data integration occurs on three levels, namely:
i. Form
This refers to the representation used to express the data; it refers to both the
syntactic and physical representation. These different expressions are translated
consistently into a single format for use within the data warehouse. In Figure 4.2,
there is a demonstration of the encoding of gender. Ultimately, this will allow data
queries across all stored subjects [Sil08].
The reason for the representational inconsistencies is that application designers and
developers are usually free to decide which representation their application will use.
Operational Database
Sales Receiving Manufacturing Distribution
Data Warehouse
Customers Vendors Agents
63
Figure 4.2: Encoding of gender data [Sil08]
It does not matter which final representation the designers or developers choose for
integration, as long as it is applied consistently across all data integrated into the
data warehouse [Inm05].
ii. Function
Function refers to the meaning of data. This is best described when considering
cryptic codes used within disparate units within an organisation. The IT division
might for example use the code er05 to indicate that there is a serious problem with
the connectivity to a server, whereas the procurement division might use that same
code as an internal tracking number. Figure 4.3 shows how different data
descriptions are translated to a common meaning describing the function. Data of
this nature will need to be integrated into a single data unit that is able to describe
the different meanings [Sil08].
Figure 4.3: Semantic translation of data [Sil08]
m/f m/f
1,0
b,g
Operational Data Warehouse
small
little
minor
minor
64
iii. Grain
Grain refers to the following two concepts pertaining to data:
a. Unit of measurement
This is simply the unit of measurement of the expressed data that needs to be
converted and consistently integrated into the data warehouse. Figure 4.4
depicts the conversion of data measured in different units [Sil08].
Figure 4.4: Conversion of data format [Sil08]
b. Hierarchical representation
Disparate departments in the organisation often maintain different
hierarchical representations of data, depending on the level of detail of
data they have available or require. When this data is integrated into the
data warehouse, these hierarchical groupings will be combined and thus
form a complete view of the data [Sil08].
Figure 4.5 depicts two different hierarchical representations for a time-
key instance based on month and year. The first example shows how one
department will group time by month and then by year. The second
example depicts how the months and years are grouped on the same level
in the hierarchy.
cm cm
inch
m
65
Figure 4.5: Time-key hierarchical differences [Sil08]
4.3.3 Non-volatility
Operational data is manipulated, record-by-record, at any time and by any authorised
application or end user. This ensures that it contains (in theory at least) the most up-to-
date, accurate data to reflect the current state of the organisation. A data warehouse
dramatically differs from this in the sense that [Inm05, Lab11, Kim13]:
Data is, usually, loaded on a large-scale batch process. This is referred to as
loading a snapshot of the current operational data into the data warehouse.
Generally, once data is loaded, it will not be updated or deleted.
Instead, if updated data is found in the operational database, a new snapshot of
the data is loaded.
The methods listed above ensure that the data warehouse contains a historical record of
the organisation over time. Figure 4.6 compares the use of an operational database to
that of a data warehouse.
Time_key
Month
Year
(12/Sept/12) Tk1
(13/Sept/12) Tk2
(23/Oct/12) Tk3
(24/Oct/13) Tk4
Sept/12 Oct/12 … Oct/13 …
2012 2013 …
Time_key
Month
Year
Tk1 Tk2 Tk3 Tk4 …
September October 2012 2013 …
66
Figure 4.6: The non-volatility of data warehouse data [Inm05]
4.3.4 Time variant
Due to the non-volatile nature of the data in the data warehouse, a resulting
characteristic is that of the time variance of data. This means that the data in the data
warehouse is accurate at some point in time. Time is an important concept in a data
warehouse, and can either be embedded into each record by means of a timestamp or
by having a date associated with every snapshot. Time horison refers to the length of
time that data is represented within the chosen data environment [Inm05, Lab11].
Projected data that is generated through some statistical modelling might also form part
of the data warehouse. The model uses the data it has available at a point in time to
generate projections of future possible data, this data is also time-variant in the sense
that it used data that was relevant at some point in time to generate this [Rob07].
4.4 Creating analytical data
This section gives an overview of how operational data is transformed into analytical
data for use within a data warehouse environment.
4.4.1 Selecting data sources
The first order at hand is to decide what data to use. There are essentially four sources
of data that can be integrated into the data warehouse:
Operational Database
Data Warehouse
insert
update
update
read
read
load
67
i. Operational data
The operational data of the organisation resides in many disparate database systems,
most of which are tailored to meet the needs of the applications and their users. These
database systems might include relational databases, other operational databases such
as those mentioned in Section 3.5, but they can also include legacy systems as well.
Ponniah states that the queries to data in an operational database are usually narrow in
the sense that only specific information is retrieved and viewed or manipulated at any
given time [Pon01].
ii. Archived data
Once operational data reaches the end of its time horison, it needs to be removed in
order to ensure that only current and consistent data is kept in the operational
environment. The time horison of data is no set measure and will vary per business
requirement. This is referred to as the archiving of data, and occurs periodically, as
dictated by the organisational policies.
Many different levels of archiving exist, namely [Van12]:
Moving expired data to an archive database, that is still accessible online.
Moving data from the archive database to flat file storage on mass data storage
mediums such as disk and tapes.
Moving the disk stored data to external storage and possibly keep it off site.
It needs to be determined beforehand what archived and historical data will be loaded
into the data warehouse in order to give enough information, but not to overload the
warehouse [Pon01].
iii. Internal data
Most departments within the organisation will keep track of the data that they, and only
they, use and produce. The organisation as a whole might not even be aware of the
existence of this data, but it could possibly be useful in the data warehouse
environment.
68
These are often stored in:
Documents
Spreadsheets
Departmental Databases
A departmental database is often referred to as a data mart, and will be discussed later.
The data in documents and spreadsheets present the biggest challenge as they need to
be integrated by making use of applications to convert the data formats or, worse still,
manually [Pon01, Van12].
iv. External data
From a management perspective, an organisation’s overall position in the market can
only be determined when compared to their competitors or other influential forces in
their market environment. It is therefore essential to integrate data from external
sources into the data warehouse environment. These data sources are volatile, in the
sense that they can be updated hourly (stock markets), monthly (car sales) or annually
(financial results). Depending on when and where the data comes from, the loading of
external data needs to be planned carefully [Pon01, Van12].
4.4.2 Integrating the data
As mentioned in the previous section, the largest problem faced in building a data
warehouse, is that the data from disparate sources are largely unintegrated. The first
problem is the matter of how to interface with all the data sources: if there are legacy
systems, new interface and translation applications might need to be developed. If the
operational data comes from multiple technology platforms, all these platforms will
need to be bridged in order to access and translate the data.
The second problem is integrating the data. As mentioned in the previous section, data
needs to be integrated across form, function and grain. Once a better understanding of
the data sources have been established, matters such as semantic data translation can
occur.
69
The third problem, according to Inmon, is to determine how to efficiently access all the
data sources and import the data to the data warehouse. The following is a suggestion of
how the data should be loaded [Inm05, Van12]:
Archival data – This is not always loaded as organisations might decide that it is
of no use. If it is loaded, however, this will be loaded only once and that is done
when the data warehouse is created.
Internal data – If internal data from other sources need to be loaded, special
arrangements need to be made to load this data, it might be a once off or on a
continual basis.
Operational data – The loading of operational data occurs in two stages, the
first is to load the operational data environment (only that deemed usable for
data warehouse purposes) as it is when the data warehouse is created. The
second stage is the continual loading of updated operational data and will be
discussed next.
External data – If the organisation decides that external data is to be loaded, the
same approach than for operational data will be followed, the only exception
being that the uploads might happen on an ad hoc basis, as it becomes available
and is deemed useful to incorporate.
It needs to be mentioned that data is usually not pulled directly into the data warehouse
from the sources, but is usually loaded into a separate data preparation environment,
where the mentioned refinements need to be done. Once successfully completed, it must
be loaded into the data warehouse environment [Van12].
4.4.3 Loading Operational data
Operational data changes on an ongoing basis. A major problem, according to Inmon, is
the manner in which one can determine what data have changed with relation to the
data that is currently in the data warehouse. This can become a serious performance
bottleneck, as data sources are often very large due to organisations which store vast
70
amounts of data to support operations.
It should be clarified that no direct updates are performed on the data in the data
warehouse, it is merely amended with new data from operational sources, and the old
data remains intact. Techniques that can be used are the following [Inm05, Bul12]:
Timestamp – As stated in Section 4.3.4, data stored in operational environments
usually have a timestamp or some element of time associated with them.
Delta file – Some applications that relate to transactional processing might
create a delta file that logs the latest changes made in the operational data since
the last time the file was cleared. Delta files are efficient, since only items logged
in the file will be considered.
Log file – Most transactional applications will create a log file, which summarises
the activities done in a particular timeframe. A log file differs from a delta file in
the sense that it contains more metadata as it is often used for recovery
processes.
The log file might also have a very different and compressed internal file
structure.
Modify applications – Current applications, such as producing a delta file, might
be modified to enhance the data selection process. This is, however, a high-risk
activity as tampering with the old application code might lead to disastrous
ramifications that were not anticipated. A good idea might be to ensure that all
new applications developed comply with data extraction requirements6.
Snapshot comparisons – As a last resort, when data is loaded a snapshot of the
state of the operational data can be taken. Before performing a needed import, a
new snapshot is taken of the current operational data, and these two are
compared to each other to determine which transactions differ. This is by no
6 The suggestion to ensure that new applications adhere to data extraction requirements will not solve the legacy application problem. The lessons learnt from having legacy applications that cannot serve the data required should be applied to designing new applications that have reporting and data extraction in mind.
71
means an efficient process and should be considered only when all other
attempts have not produced the desired results.
4.5 Data warehouse and data models
There are two architectures which guide modern data warehouse design and
implementation, namely:
4.5.1 Inmon’s top-down approach
Inmon’s model is one of the predominant approaches to follow and is based on the
creation or pre-existence of an enterprise data model that represents the enterprise as a
whole. The model that represents this is nothing other than a classical normalized
relational model used within OLTP, as described in the previous chapter [Lob09].
The main data warehouse’s model is enterprise wide and is thus not well suited to
handling queries made from certain departments. When departmentally specific data
has to be compiled frequently, it is best to create data marts to serve the specific
requirements [Lob09].
i. Data mart
A data mart can be defined as a single-subject subset of a data warehouse that is used
within departments for their analytical analysis only [Rob07].
There are two types of data marts:
Independent data mart – An independent data mart is one which is created by
the users within a certain department. There is no need to think globally when
creating one [Inm05].
Dependant data mart – A dependant data mart is one that is created from the
data within a data warehouse. Some planning is required as the data pool
contains data from various departments. The designers need to have knowledge
of the larger data warehouse to create one [Inm05].
72
ii. Inmon’s data warehouse
Figure 4.7 depicts Inmon’s approach: in this instance dependent data marts are created
to aid in ad hoc query performance. This approach is referred to as top-down, because
of the constraint that an enterprise wide model already exists or can be created before
the data warehouse’s inception [Lob09].
Figure 4.7: Inmon’s data warehouse architecture
The approach that will be mentioned in Section 4.5.2 should not be confused with the
notion of a data warehouse being constructed from a combination of independent data
marts, a process also referred to as the “bottom-up” approach, but still related to
Inmon’s approach.
4.5.2 Kimball’s bottom-up approach
The fundamental problem with Inmon’s approach is the requirement that an enterprise
wide model must be created before the data warehouse is created. This is an enormous
undertaking and often not feasible. A more favoured approach is to incrementally build
and expand the data warehouse. Kimball proposed an alternative to Inmon’s approach,
consisting of the following four key points. This will be discussed next [Lob09].
i. Dimensional modelling
Dimensional modelling is an approach based on the way in which the end user sees
File
Spreadsheet
Database
Data Mart
Data Warehouse
73
data. The end user will always see data as it relates to various dimensions. A sale might
for example be related to a customer and a product or service. Kimball states that, to
achieve this view, data is divided into what are called “measurement” and “context”
[Kim08].
A measurement is a tangible value and is usually a numeric value or date. These
measurements are the core data that is captured in transactional systems, and are
referred to as facts and stored in a fact table in the dimensional model [Kim08, Kim13].
A fact on its own is meaningless; it needs to be put into context. Kimball states that
context is regarded as dimensions in the dimensional model and describe the "who,
what, when, where, why, and how" of each fact. The dimensions are stored in a
dimension table. Multiple dimension tables will link to a fact table in order to give the
stored facts meaning [Kim08, Kim13].
ii. Star schema
As mentioned in Chapter 3, the process of normalisation is applied to tables in the
relational model in order to remove the replication of data, for transactional processing
performance and storage efficiency purposes. In data warehousing, the main goal is to
ensure that the data is stored in such a way as to model the way users perceive the data.
Relational tables are often denormalised into fact and dimension tables that contain
replicated data [Kim08, Ada10, Kim13]. Figure 4.8 shows a typical dimensional
structure, aptly referred to as the star schema, due to the fact that the dimension tables
surround a fact in the table in such a way that the resulting structure resembles that of a
star shape. The fact and dimension tables are stored in a normal RDBMS and are indeed
relational tables. When the dimensional model is used to store data in an OLAP
multidimensional data environment, the resulting structure is called a cube [Kim08,
Ada10].
74
Figure 4.8: Dimensional modelling – star schema [Kim08]
iii. Snowflake schema
In the dimensional model, a dimension is related to one table. Sometimes a dimension
table can become very large, thus an attempt to address this, is to remove attributes of
the dimension table that are often not accessed [Lan07]. If the creation of tables in a star
schema can be called denormalisation, then the attempt to split up dimension tables
from the star schema is indeed an attempt at normalising the star schema as seen in
Figure 4.9. This should only be done if business needs require the creation of such
structures, as it will hamper the performance due to the multitude of joins which will
need to be performed to access the data in the split-up tables [Lan07, Ada10].
Figure 4.9: Star schema showing derived dimension tables [Kim08]
Dimension
Fact table
75
iv. Kimball’s data warehouse
Figure 4.10 depicts Kimball’s data warehouse. It is the collection of all the star schemas
stored within the relational database. There are no separate data marts created due to
the fact that the structure can process queries originating from various departments
[Lob09].
Figure 4.10: Kimball’s data warehouse architecture
This is by no means a complete summary of the processes involved in creating and
maintaining a data warehouse. Many more aspects need to be considered to implement
a data warehouse environment, but these aspects are beyond the scope of this
dissertation. A more detailed discussion can be found in [Lob09, Kim08, Rob07].
4.6 OLAP and data warehouse
OLTP systems used for operational data are designed for optimal speed and throughput
of data. They are designed to capture operational data quickly and effectively, but are
not at all suited to analyse that data.
4.6.1 Data analysis example
To illustrate the statement made above, consider the example below [Rob07]:
Consider sales made to customers: A sale is always related to a customer and a
particular product or service and usually forms part of an invoice. An invoice will also
Data Warehouse
Queries ETL
File
Spreadsheet
Database
76
consist of invoice line items depicting the individual items purchased. Table 4.1 contains
the summarised information relating to a sale, listing a customer and the total spent,
which is used for an invoice summary. Table 4.2 contains the detailed transaction
breakdown, containing products purchased, quantities and totals per product which
will be used as line items on an invoice. In fact, the structure depicted below, is exactly
the way in which most OLTP systems store this type of transactional data, as an invoice
and invoice line pairs.
As mentioned in the section on data warehousing, data needs to be associated with
other data for it to make sense. In a real world environment, business analysts will often
relate the sales data presented above to customers and to products (i.e. to other
dimensions).
In Table 4.3 a multidimensional view of the sales data is presented. For the sake of
simplicity only two dimensions, namely product and time have been compared. The
third dimension, customer, can be thought of as linking product and time.
The data from the linear OLTP tables have been combined into summaries related to
each dimension, as represented in the white block of data above. The total sales, total
quantity and grand total data are aggregations that summarise the given data.
Table: 4.1 OLTP Invoice Table
Invoice_Number
integer
Date
date
Customer
varchar2
Total
double
0001 27/04/2008 MVW Engineering 25,000.00
0002 27/04/2008 STL Systems 16,500.00
0003 28/05/2008 STL Systems 10,000.00
77
Table 4.2: OLTP Invoice Line Table
Invoice_ Number Line Product Price Qty Total
0001 1 RFID Tag 20.00 500 10,000.00
0002 1 RFID Tag 20.00 300 6,000.00
0002 2 SSD 300GB 5,250.00 2 10,500.00
0003 1 SSD 300GB 5,000.00 2 10,000.00
By storing data in a dimensional data structure as the one presented, it is possible for
business analysts to ask questions relating to the various dimensions, such as: “What
was the total sales of RFID tags for May 2008?”.
If the customer dimension were visible in Table 4.3, one would be able to see which
customer was the top buyer of RFID tags. Thus by answering these questions, the
customer market that will most likely generate the largest revenue can be identified,
and appropriate deals or specials can be structured accordingly.
Table 4.3: OLAP Dimensional Table
Time Dimension
Product
Dimension
27/04/2008 28/05/2008 Total Sales Total
Quantities
RFID Tag 16,000.00 1,600.00 17,600.00 880
SSD 300GB 10,500.00 10,000.00 20,500.00 4
Wifi Controller 6,000.00 6,000.00 12
Grand Totals 26,500.00 17,600.00 44,100.00 896
Customer Dimension
78
4.6.2 OLAP overview
Online Analytical Processing (OLAP) systems are designed to handle the analysis of data
as presented in Section 4.6.1. OLAP may be used in conjunction with normal OLTP
systems (such as relational databases) or as an integrated part of the data warehouse
implementation. The differences between the OLAP system and the OLTP system will
now be discussed in more detail.
There are five core benefits that an OLAP system has over an OLTP system in terms of
data analysis [Sch10, Ada10, Cor12]:
Business-focussed multidimensional data – The first thing to note is that data
in an OLAP system is stored in a multidimensional data structure known as a
data cube. A cube will commonly be constructed around dimensions of data,
where each dimension represents a specific business-related view of that data.
The data is organised in such a manner as to easier achieve a business-oriented
analysis.
Business-focussed calculations – The aggregation of data involves making
informed and useful summaries of data. A core feature of OLAP is that the data is
pre-aggregated and all useful summaries are stored in the metadata inside the
system itself. In business terms these summaries often take the form of
calculations and their results. Thus, many calculation’s results are computed and
stored within the system.
Trustworthy data and calculations – There are often many data sources within
the organisation; it might be individuals working on their own spreadsheets or
even various reporting systems with their own databases. Due to the disparity of
the data and calculations based upon them, varying results are produced and
incorrect reports might be produced. An OLAP system will thus be a central
repository of data upon which calculations are based.
Speed of thought analysis – This is more commonly known as ad hoc queries,
meaning that data analysts are able to pose questions to and search through the
79
data stored in the OLAP data source. Due to the multidimensional nature of the
data and the pre-aggregation of data, searches can be performed across any
business dimension very quickly. Classical relational database systems are
unable to respond to queries of this nature without formulating complex select
statements to bring the data together, meaning it is up to the database developer
to retrieve this information.
Self-service reporting – The business users are most familiar with their specific
domain, and have the domain experience needed to analyse the data. OLAP
systems allow the end user to easily create reports and views of the data they
require. They often include comprehensive dashboard tools as well. This
eliminates the need for the development team to continually create
reports/views of data at the implementation level.
4.7 Data Warehouse structures and NoSQL databases
Section 3.5 introduced the NoSQL database notion which comprises alternate
operational databases such as the document oriented database and key/value stores
with MongoDB being one of the popular alternatives to the relational database. Data
warehouses predominantly make use of a star schema which is essentially a
denormalised relational data structure. This requires the de-normalisation of relational
structures into the star or snowflake schemas as mentioned in Section 4.5.2.
Hadoop is an open-source large-data platform which was developed to process data-
sets. A new approach to data warehousing which is gaining traction is to make use of
Hadoop MapReduce to process the data within a MongoDB installation [Had13].
4.8 Conclusion
Chapter 3 gave an overview of operational data and the way in which it is stored in the
modern data environment. Chapter 4 introduced BI and the data warehouse
environment, showing why operational databases (such as the relational database) are
often not well suited to performing a complex analysis of data. The reason is rooted in
the fact that relational databases specifically are designed for high-speed, failsafe and
80
succinct data queries for transactional processing in an operational environment.
Kimball and Inmon’s two different approaches to construct a data warehouse, namely
bottom-up and top-down were compared. The fundamental problem with Inmon’s
approach being that an enterprise wide model must be in place (which would imply a
holistic overview of all business units) before the warehouse structure can be
envisioned (a task often not feasible in large corporations). The approach envisioned by
Kimball suggests that a warehouse is built piece by piece by means of star-schemas,
with the entire warehouse being the set of all sub-star-schemas. This approach is more
feasible as each departmental data mart can be integrated into a larger whole as well.
The document oriented database MongoDB can be used as an alternative solution to the
traditional data warehouse by making use of the Hadoop platform to process the data.
Regardless of which approach is taken to construct the warehouse, it will always
employ highly optimised multidimensional data structures. It is often supplemented
with an OLAP system that allows easier access to the views of dimensionality.
Multidimensional data is depicted in a star or snowflake schema that is stored in
multidimensional data cubes within the data warehouse. Each side of a cube represents
a dimension of data, the more dimensions the more complex the cube and with the
increase in complexity comes an increase in processing power required to analyse data.
Chapter 4 has provided an overview of the tools and methods used by organisations to
analyse the relational data that they so heavily rely upon to service member requests.
By making use of these tools, the business units can draw conclusions by way of reports
in order to make business decisions. The problem is that the ETL processes used to
maintain the data warehouses, which drive the OLAP systems, are often run daily and
nightly, which means the data is always at best one day old.
The reports generated are all still information; they are nothing more than highly
processed transactional data. This is where data mining is introduced. Chapter 5 will
investigate data mining, its relation to data warehousing and will explain why it is
incorrect to refer to an OLAP system coupled to a data warehouse as a data mining
system.
81
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
82
Chapter 5
Data Mining
5.1 Introduction
The previous chapters suggested that there are discrete levels of data. Firstly, as
discussed in Chapter 3, data (as raw facts) can be stored in a database, and by making
use of a query language, these facts can be retrieved as processed information. This data
in OLTP systems are often overwhelming and contain system-derived data, meaning it
has little to offer the user above pure transactional data.
This is where BI steps in, and as defined in Chapter 4, it is a set of tools used to gain
more information from our data. Making use of multidimensional data analysis, a
greater perspective on the data can be gained, the user can ask more in-depth questions
and some predictions can be made from the data in the data warehouse.
OLAP systems have the purpose of providing the user with more tools to analyse the
data presented in OLTP and Relational systems. The warehouse data is essentially
aggregated in multidimensional transactional fact tables maintained for the purpose of
analysis.
However, this is all still information as it is nothing more than highly processed data.
This is where data mining is introduced. Data mining often forms part of the larger
process of Knowledge Discovery in Databases (KDD) and will be discussed in Section
5.2.
Data mining is also a BI tool, and can be defined as the process of discovering hidden
information7 from existing data repositories. It is a collection of processes specifically
designed to derive highly valuable information, which when viewed in the context of the
domain it was mined in, provides knowledge and insight to the user.
7 Hidden information can be referred to as information inside known repositories which requires processing in order to be extracted in a meaningful manner. Without the processing this information could have little value to the owners, however after it has been manipulated it could provide greater insight and hence be more valuable.
83
5.2 Knowledge Discovery in Databases (KDD)
The KDD process is used to derive knowledge from the existing data in databases. Data
mining is a process that should not be seen in isolation as it is actually a sub-process in
the larger Knowledge Discovery Process (KDP) which will be discussed in Section 5.2.
5.2.1 The enterprise KDP environment
In order to gain a clear perspective of the KDP, Figure 5.1 shows the typical layout of an
industry-based data mining environment.
Figure 5.1: KDP in the production environment [Bhus10]
Figure 5.1 provides an overview of the KDP process within an organisation. As
mentioned in Chapter 4 there can be multiple data sources to compile target data with.
The organisation cannot rely on its internal data alone, it will be the responsibility of the
BI team8 to decide how and which data can be included to produce the most relevant
results [Zan99, Vog06, Tan08, Mai08, Bhus10, Bra13].
5.2.2 The source data
The source data usually does not comply with a uniform standard, as the production
environment is often Relational or Object-Oriented. The data warehouse environment is
multidimensional and the other sources might be Rich Site Summary (RSS) feeds,
spreadsheets or raw text file data. 8 A BI team is a dedicated team within an organisation (usually part of the organisational MIS structure) whose purpose is to facilitate operations related to BI (see page 60) such as data mining.
Patterns
Transformed
Data
Pre-processed Data
Selection
Pre-processing Transformation
Data Mining
Interpretation/Evaluation
Data
Target Data
… …
Knowledge
84
Unprocessed data, as mentioned in the source above, often suffer from a number of
defects, namely being incomplete and noisy. Data captured from several inconsistent
systems or sources could contain incorrect/inconsistent values or even missing data.
The source data import process cannot always be automated entirely as it requires the
BI team to ensure that the operational and data warehouse data is accurate, complete,
consistent and interpretable. Methods such as sampling can be used that will aid in the
detection of invalid data and a quantitative analysis can be used to detect noisy data. In
Section 5.3.2 a more in-depth discussion will be given [Kant10, Bra13].
5.2.3 The knowledge base
The knowledge base contains domain-specific data that guide the data mining engine
that provides insight into whether a result is worthwhile. However, this is often only
available in matured environments where the BI team is well experienced [Bhus10].
5.3 Overview of the data mining process
Figure 5.1 provided a high-level overview of knowledge discovery in an enterprise
environment (Section 5.2 overviewed the process). This section introduces the actual
data mining process9 as depicted in Figure 5.2.
9 The data mining process is often controlled by the data mining engine.
Figure 5.2: The data mining process
[Ode05]
85
5.3.1 Determining goals
The most important part of the data mining process is to first and foremost determine
the goals. This step is often overlooked as most BI teams just decide which data to
incorporate and run the mining process based on that. However, if proper planning is
done to determine what knowledge is to be gained, the data can be more appropriately
selected. For example, if the goal is to determine why different demographic regions
have different sales figures, a number of questions need to be asked [Ode05]:
Which products are being compared?
Which regions are being looked at?
Which factors are being considered (socio-economic strata, geographic locations,
etc.)?
Once this has been determined, more detailed decisions can be made on which data
needs to be included: sales data, customer data or financial indicators for the time
period. Sales and customer data might be readily available in the enterprise
environment, but financial indicators could be external information that needs to be
incorporated [Ode05, Bra13].
5.3.2 Data selection
As discussed, data can be gathered from various sources, and it will be up to the BI team
to decide which data should be included in the process. As seen in Figure 5.2, a good
approach is to firstly incorporate the operational data, which in this context, is any
enterprise data, including operational and data warehouse environments. The phrase
“garbage in, garbage out” has never had a clearer meaning than when viewed in the data
mining process. If the target data is of low quality, the entire data mining process is of
little or no use, and the knowledge gained might not add any value.
Figure 5.3 depicts the sources of data for the data mining process. The BI team might
decide to include data from the data warehouse sub-systems, the operational databases
and external sources. Figure 5.2 shows that it is often the case that external data is
86
incorporated after the enterprise data has been cleaned and corrected.
Figure 5.3: Data selection
Chapter 4 introduced the data warehouse, where it stated one of the primary
characteristics as being an integrated environment. Depending on the organisational
needs or the requirements agreed upon in the goal selection process, most of the data
might be derived from the data warehouse environment. This has the advantage that
most of the data should (to a large extent) be clean and consistent. For this reason, it is
often the case that enterprise data mining systems are built on top of existing data
warehouse environments. Figure 5.1 (see page 83) suggests that the data mining engine
interacts with the pre-processed data and is often referred to as the data warehouse
server.
a. Cleaning data
Cleaning data from dispersed locations often comprises of the filling in of missing
values, the resolving of inconsistencies, the dealing with duplicate data and the
smoothing of noisy data on outliers.
87
As stated in Section 5.2.2 there is no automated approach to data cleaning. When the
source data is dirty, human intervention will be required in order to clean the data
before it is incorporated into the target data. The following approaches should be used
by the BI team when the data is prepared:
i. Missing data
Missing data pose a unique problem, in the sense that there was a definite problem in
the data capturing phase. The best approaches to take for data mining purposes are
[Meh03]:
Use an attribute median.
Use a class median (the problem here is that the attribute would need to have
been associated with a class already).
Use regression to determine the most likely value.
Derive the attribute from other data.
If none of these can be applied to the field, it is best to ignore the record as a whole.
ii. Resolving inconsistencies
Chapter 3 stated that a fundamental problem of database systems is that inconsistent
data can quickly become a problem, with a major factor being wrongful data entries. As
seen in Chapter 4, data integration also poses a problem, as inconsistent data can arise
from disparate sources. Thus, if proper database practices are maintained in the
enterprise, with best-effort attempts at reducing inconsistencies, the only inconsistent
data at this stage should be the inclusion of external data. This external data is best
converted and integrated by using the same principles used for the data warehouse, as
discussed in Section 4.3.2 [Ode05, Bra13].
iii. Duplicate data
The process of integrating data will often reveal duplicate data records. This can often
be a result of a record having many similar fields, with only one field differing. The
88
approach is to determine to what extent the records are similar (by comparing fields);
and then to merge the duplicate entries into one record. Inconsistent data will often
arise as this is a form of integration, and this needs to be handled appropriately.
Figure 5.4: Visual representation of outliers
iv. Outliers and noisy data
An outlier can be defined as a record which differs considerably from the other records
in its group or class and is classified as noisy data. The approach followed with outliers
is to detect the outliers and then have a data expert determine if the value is wrong. This
then needs to be removed or adjusted, except if it actually contains valuable information
that needs to be kept [Agg12, Bra13].
As seen in Figure 5.4, the outliers are detected by determining that some data does not
form part of groups. This implies that outlier detection most commonly involves data
clustering. This concept will be explained in Section 5.4.
Outlier
Group A
Group B
Group C
Group D
Outlier
Outlier
89
5.3.3 Data pre-processing
Once data has been cleaned and integrated into the source target data for the data
mining process, the data still needs to be changed in order for the data mining algorithm
to process it effectively. Typical pre-processing tasks include classification, clustering
and dimensionality reduction. These tasks will be explained in Section 5.4.
Figure 5.5: Linear task-oriented overview of the data mining process
5.3.4 Data mining
Once the data has been transformed into the appropriate format, the data mining
algorithm can be used to extract patterns and derive rules. Section 5.6 provides some
detail on how these rules are derived.
5.3.5 Using the knowledge
Once patterns have been established, post-processing is applied to produce meaningful
90
rules and reports. The rules and reports provide new knowledge for the end users to
help them with their business decision making.
5.4 Data pre-processing tasks
As mentioned in Section 5.3, data is firstly cleaned and translated into the proposed
target data for input into the data mining process [Dat05]. The author wants to stress
that target data is often large, necessitating the need of the data interaction layer (as
depicted in Figure 5.1 on page 83). If it is the case that the warehouse data is already
well formatted, this middle layer will be used as an on-line access to that data. This
implies that only data that really needs to be cleaned and corrected will be transferred
to the temporary data preparation environment. If a data warehouse environment
exists, the data standards used in the transfer process will be the norm for all other data
required for the data mining process [Han10, Agg12, Bra13].
An important standpoint taken in this dissertation is that the task of cleaning data,
although seen as a part of pre-processing by other authors [Dat05, Fre05, Gup10], is
part of acquiring the target data and not of the pre-processing of the data to conform to
data mining requirements. The pre-processing task is expensive and thus the BI team
should firstly investigate whether or not the target data is suitable and clean before
proceeding. The following section summarises the techniques used to refine the data
for this purpose:
5.4.1 Data denormalisation
In Chapter 3, the relational database model with normalisation as a key concept was
discussed. A table was transformed into 1st, 2nd and 3rd normal form during the
normalisation process. Data warehousing, covered in Chapter 4, referred to fact tables,
a table of data that denormalised relational data from multiple tables into a single table.
Denormalisation is thus the intentional process of introducing redundancies into the
data in order to speed up performance for data mining tasks. The reason why de-
normalised structures are faster is due to the lack of complex table joins, which must be
performed in a relational database. Data in the data warehouse environment should
already be denormalised and ready for use. Relational and external data which were
91
transferred to tables into the mining data staging environment, should be subject to
denormalisation. A clear understanding of the data is required, and most commonly this
is still a user-driven process [Agg12, Bra13].
5.4.2 Clustering
Clustering is the process of detecting subsets of similar data. The exact nature (or
structure) of the data is not known, therefore we do not know how to classify the data
as part of a specific class, and hence the aim is to group it into clusters which have
similar characteristics [Wan05, Rui09, Eve11, Kog13]. There are three general methods
of producing clusters, namely distortion-based clustering, density-based clustering and
divisive and agglomerative clustering:
a. Distortion-based clustering
In distortion-based clustering,K-Means is the predominant algorithm. In this approach
both the number of potential clusters, 푐 and the sample data set, 푫 are given. The goal is
to group the elements in푫 into the cluster vectors휸 , … ,휸 .The aim then is to search for
the squared Euclidean distance10 between a cluster mean and the potential cluster
elements producing cluster vectors which group elements [Hul04, Gan07, And06,
Eve11, Agg13].
i. General K-Means algorithm
The basic K-Means clustering algorithm is can be performed in the following
manner:
1. initialize 휸 , … ,휸
2. do for each 흑 ∈ 푫
3. calculate 푎푟푔푚푖푛 ∥ 흑 −휸 ∥, 푖 = 1 … 푐
4. 흑 is added to 휸 if 푎푟푔푚푖푛 < ε
5. recalculate ∀ ∶ 휸 ← 푎푣푔(푎푙푙푠푎푚푝푙푒푠 ∈ 푐푙푢푠푡푒푟푖)
6. until no change in 휸 ,∀풊
10 Euclidean distance refers to the distance between two known points.
92
The K cluster vectors 휸 , … ,휸 are initialised containing 0 elements and randomised
cluster centres. Thereafter the Euclidean distance between each element, 흑 ,of the data
set 퐃and the current cluster vector 휸 is calculated. If the error is smaller than an
acceptable value ε, the element is added to the current cluster vector. In line 5 the
cluster mean is calculated for all samples already in the current cluster, to which the
next iteration is compared [Hul04, Eve11, Kog13].
ii. Mean squared error
Line 5 refers to the minimization of the mean squared error (MSE) between the cluster
mean and the potential cluster elements as given below.
푘 − 푚푒푎푛푠푀푆퐸 = 푴 (흑 ) ∥ 흑 −휸 ∥
where 푴 흑 = 1푖푓푖 = arg푚푖푛 ∥ 흑 −휸 ∥0표푡ℎ푒푟푤푖푠푒
This is referred to as the closest cluster prototype for the Euclidean distance and results
in quantization regions – the data set푫is partitioned into non-overlapping regions
[Eve11, Agg13].
Figure 5.6: Clustering example [Hul04]
Cluster 2
Cluster 1
= typical profile
Cluster 3
93
Figure 5.6 visually depicts the output of this clustering process. Data is grouped
according to how well it fits certain clusters’ characteristics (or profile). Cluster
dimensions are usually spherical around the centre or mathematically grouped around
the mean for that cluster [Hul04, Kog13].
b. Density-based clustering
In density-based clustering no assumptions are made, rather density estimates are
calculated at various data points in the data set푫. The hill-climbing or gradient descent
approach is a popular approach [Hul04].
i. Hill-climbing approach
1. do for each element 흑 ∈ 푫
2. search for a higher density rate within a specific
distance K from 흑 .
3. move to the point 흑
4. until no further progress available
The resulting point for each element 흑 is called a locus of the local density peak and the
set of all these loci are the identified clusters. A data point 흑 which has the same
density peak as one of these clusters, is considered an element of that cluster [Hul04,
Kog13].
Figure 5.7: Visual representation of hill-climbing clustering [Hul04]
94
The image on the left of Figure 5.7 shows how cluster loci were identified using the hill-
climbing approach, the image on the right is a top-down view of the clusters and cluster
regions which govern the elements for this data set.
c. Divisive and agglomerative clustering
In divisive clustering the data set is progressively sub-divided until all identified
clusters have been established. Figure 5.8 depicts how, even inside the identified
cluster, the data is divided into smaller clusters. In agglomerative clustering, clusters
are progressively merged as character traits reveal that they have aspects in common
with other clusters [Hul04, Gan07, Eve11, Agg13].
Figure 5.8: Divisive clustering [Hul04]
Section 9.2 provides an overview of how both numeric and non-numeric data is
transformed for use in the Euclidean calculation of distances for clustering purposes.
5.4.3 Classification
Classification refers to the arranging of data into a specific class. These classes are
similar or share some relations. The goal is to find a classifier that will map data to a
class, labelling it as part of that class. Data in a specific dataset’s label distinguish it from
data in another dataset. The difference between classification and clustering is that in
classification you have pre-determined classes of data (known as labels) and a search
for elements that fit into a class [Wan05, Ana06, Han11, Giu12, Led13].
95
a. K-Nearest Neighbour (KNN) classification
Provided with a dataset 푫 and a set of labels 푳 = {(휸 , 푙푎푏푒푙 )} defined for this dataset,
the algorithm to label data elements from that set is given as [Hul04, Giu12, Led13]:
1. For each element 흑 ∈ 푫
2. Map label of 흑 from set L
3. Calculate the K closest elements of푫
4. Label 흑 with labeli depending on the majority of
neighbours with corresponding labels
Figure 5.9: Visual depiction of K-Nearest Neighbour classification
Figure 5.9 depicts how a data element (or sample) of the dataset which does not yet
have a label is given a label. In this sample, there are seven elements with the same class
within the nearest ten searched for. The new element X will be given the same label as
the other seven.
5.4.4 Dimensionality Reduction
Dimensionality of data refers to the number of components that describe the data and in
unlabeled sample
K = 10 nearest neighbours
96
a relational database it is the number of columns in a table [Cun08, Car12b]. From a
mathematical perspective we define dimensionality reduction as:
For the 푛-dimensional variable 휓 = (휓 , … ,휓 ) we are seeking a lower dimensional
representation 휁 = (휁 , … , 휁 ) where푧 < 푥. The content of the original data is
represented by 휓. The 푛 multivariate vectors, such as 휓 , are often called features,
attributes or variables [Imo02, Wan12].
By way of an example, we may look at nature and envision the birds in an estuary. We
know that we are examining birds and hence we may decide to exclude certain features
that distinguish them from other animals. We may decide to ignore common features
like “has 2 legs”, “has one beak” and “almost all species can fly” in favour of features that
we may use to distinguish birds from each other. We might end up with a set such as:
“length of beak”, “colour of feathers”, “wingspan” in order to classify certain birds. We
could use the “length of beak” in combination with “wingspan” to infer if a specific bird
eats nectar from flowers or if it is a bird of prey.
a. High-dimensional multivariate data
High-dimensional multivariate data is subject to the curse of dimensionality, a term
coined by Bellman in 1957. It states that as the number of input variables increase, it
becomes exponentially more difficult to fit data models or to optimise an objective
function in the search space. In terms of data mining, the more features there are, the
more difficult it becomes to extract meaningful data [Tia09, Cun08, Kog13].
High-dimensional data often tends to be noisy, meaning that there are unnecessary
variables describing the data. These complicate the data model and translate to
increased resource requirements to process. The dimensionality of data refers back to
the classification and clustering problems posed at the beginning of this chapter. The
classifier needs high classification accuracy and in order to achieve that it needs to work
with a simple data model. Therefore, the execution of proper dimensionality reduction
is important before any of the other data mining techniques may be applied [Ana06,
Tia09, Kay10, Car12b, Kog13].
97
b. Dimensionality reduction with the goal of classification: the curse of dimensionality
In order to properly explain what happens when a classifier attempts to work with a
sub-optimal highly dimensional dataset, consider the following toy problem:
There are 3 classes of objects that need to be identified in a single dimensional search
space (i.e. there is one feature describing them - 푥 ) as seen in Figure 5.10.
Figure 5.10: Class identification in one dimension [Pas07]
The simplest approach would be to make use of binning, splitting the search space into
three equal bins, and then to compute the ratio of example objects in each bin in order
to determine the dominant class in each bin. In the example above, the predominant
classes for each bin are: red, green and blue with two counts in each bin [Pas07, Zam08,
Wan12].
However, the user detects that the overlap of classes in the bins is high, thus the
introduction of a second feature by which to identify the objects might help.
Figure 5.11: Class identification in two dimensions [Pas07]
Figure 5.11 depicts the introduction of a second feature, 푥 . The number of bins is now
푥
푥 푥
푥 푥
98
nine. The image on the left shows a scenario with the same object density per bin as in
Figure 5.10, while the image on the right indicates a scenario that has the same number
of objects in the search space [Pas07, Zam08, Wan12].
The trade-off is to decide what ratio between density and example count to use. With
constant density the number of examples becomes3 = 27, on the other hand, a low
example count produces a sparse matrix.
Figure 5.12: Class identification in three dimensions [Pas07]
Figure 5.12 shows the same class identification problem but with a third feature,
푥 ,added to the domain. Visually this results in 27 bins and a constant density count of
3 = 81 or a very sparse 3D matrix if the same number of examples is used.
c. Problems pertaining to high-dimensional datasets:
The following are some problems that arise when working with high-dimensional data [Pas07, Wan12]:
The number of examples per variable increases exponentially as the number of
variables increases.
There is an immense increase in volume by adding a higher dimensional feature – consider the jump in the examples between the 2D (27 samples) and the 3D (81 samples) spaces.
The higher the dimension, the more difficult it becomes to distinguish samples.
푥
푥
푥
99
5.5 Dimensionality reduction techniques
In dimensionality reduction, there are two approaches that can be followed, namely
linear and non-linear methods. A linear method is often referred to as a technique called
feature extraction, whereas non-linear methods are referred to as feature selection.
5.5.1 Feature extraction (linear)
The most common linear dimensionality reduction technique is called the Principal
Component Analysis (PCA) [Chu05]. Referring to the datasets defined previously, one
has the following linear transformation:
휁 = 휓퐴
Where 휓 is the original 푥-by-푦 data matrix and 퐴 the 푦-by-푧 transformation matrix,
which will project 휓 onto the 푧-dimensional subspace 휁 with the implied constraint that
푧 < 푥 [Chu05].
The construction of matrix 퐴 is thus the goal. A Principal Component (PC) is an
orthogonal linear combination of an original variable, which satisfies some variance
constraint, and in the case of matrix 퐴, a PC represents one column [Imo02, Boz03,
Ban05, Wan12].
The first PC, 푐 , is the one with the largest variance and can be defined as:
푐 = 푥 휓
The 푥-dimensional co-efficient vector for 푐 is defined as:
휓 = (휓 , , … ,휓 , )
Where 휓 must also solve the following linear constraint:
휓 = arg max‖ ‖ 푣푎푟 {푥 휓}
The second PC will be the linear combination with the second largest variance and also
orthogonal to 푐 . Continuing this pattern, matrix 퐴can be constructed from 푦PCs, hence
100
there are the same number of PCs than the original variable, each having less variance
[Imo02, Boz03, Ban05].
For very large datasets the first few PCs often define most of the variance in the set and
the others may be disregarded. Using this approach, a much leaner set of data can be
built, which explains most of the data from the original dataset.
The calculated variance however often depends on the magnitude of the variables (i.e.
how similar in measurement the data is to begin with). It is the norm to firstly ensure
that data has a standard deviation of one and a mean of zero.
Thus the co-variance matrix Σ is:
Σ × =1푛휓휓
and can be used in conjunction with the spectral decomposition theorem to derive Σ as
Σ = UΛU
Where Λ = diag λ , … , λ is a diagonal matrix of ordered eigenvalues, λ ≤ … ≤λ . U
is the orthogonal matrix made up of the described eigenvectors and U represents the
derived projection matrix 퐴 [Imo02, Boz03].
The principal components can be derived in the following way:
휁 = 푈 휓
where 휁 represents the 푧 dimensional sub-space 휓 that is projected onto the
component. 휁 is now regarded as a reduced dataset upon which data mining tasks can
be carried out [Imo02, Boz03, Ban05, Wan12].
5.5.2 Feature selection (non-linear)
Feature extraction essentially applies a mapping to the original dataset to derive a new
smaller set. Feature selection aims to find a subset of the original dataset by evaluating
the relevance of each variable from the original set [Sik07, Khu08, Tan09, Wan12].
101
Hence heuristic methods form the essence of this technique [Hus02]. These methods
will be investigated in Chapter 8 of this research.
The following section describes the data mining algorithm’s rule extraction phase,
which uses the features as derived by the techniques introduced in Section 5.5 as an
input.
5.6 Rule Extraction
After the data pre-processing tasks have been completed, the data that must be mined is
in an appropriate format for the data mining algorithm to run effectively. A data mining
algorithm can also be classified as a search algorithm, as it executes a process known as
rule extraction. In essence it builds a Boolean expression tree comprised of IF-THEN-
ELSE statements, which represents the rules applicable to the data [Wan05, Wit05,
Ant09].
Rule extraction is a process that can be broken up into two distinct processes [Agg12,
Bra13]:
Searching for rules based on output classes in a dataset containing well-defined
labels.
Searching for rules in datasets that do not contain labels, this process is defined
as association rule mining.
The two processes listed above form part of the rule mining and extraction part of the
KDP and they will be discussed in the following section.
5.6.1 Association Rule mining
Association rules are the mathematical expressions of IF-THEN-ELSE statements and
are most commonly used to process transactional data. The expression 퐴 ⇒ 퐵 implies
that if a transaction 푇 ∈ 퐷 (where D is the transactional dataset) contains A then it will
also contain B [Gho04, Wan05, Vog06, Kuo09, Raj11, Bra13].
By way of an example, an association rule derived from online shopping transactions
102
could show that people who bought a laptop were also likely to buy a laptop bag or
peripherals such as a mouse. Thus the derived rule could be used to improve the
structure of the online store, to for instance advertise these items on the laptop section
or to bundle them with laptops as part of deals.
5.6.2 Rule extraction algorithms
Authors proposed many rule extraction algorithms, including neural networks, fuzzy
logic systems and Genetic Algorithm based ones. Regardless of which algorithm is
chosen for the extraction process on a specific dataset, the following pre-conditions
apply [Gho04, Wan05, Wit05, Ols08, Bra13]:
The dataset must be pre-processed as described earlier in this chapter.
The values of the dataset should be normalised, meaning that they should fall
within a defined range. For most algorithms, the range [0,1] is used.
5.7 Conclusion
Chapter 5 introduced the concept of data mining, why it is used and how it relates to the
other data warehouse activities such as OLAP analysis. Data mining is the process of
transforming raw information into knowledge. The output of data mining is knowledge
in the form of tangible rules, and these rules can be used to discover what trends,
relations and hidden facts are present in the data.
It showed that data from the data warehouse and other sources needs to be de-
normalised and restructured for a data mining algorithm to be applied on it. The task of
data pre-processing is a three-part process:
i) classification which determines if the data is part of a specific class,
ii) clustering which aims to classify data that could not be mapped to a class, and lastly,
iii) dimensionality reduction.
Dimensionality reduction is the most important part of the process, as it reduces the
103
original dataset into a smaller sub-set and allows for a normalised dataset.
The data mining process is often an afterthought for many organisations as they
attempt to collate data from various sources when the need arises. Effusive data from
relational databases and highly structured data from data warehouses is selected as
target data, this data needs to be transformed and then the pre-processing tasks can be
run on this data before any data mining algorithms can derive rules. Chapter 8 will
introduce the EPS4DM model which addresses this notion of data mining as being an
afterthought, it incorporates all three the pre-processing tasks as well as data extraction
from a relational database in order to maintain pre-processed data. This allows the data
mining algorithms to have a repository of pre-processed data at their disposal.
Chapters 6 and 7 will investigate in detail the way in which Intelligent Agents and
Computational Intelligence (CI) based methods will be applied in the EPS4DM model’s
pre-processing tasks.
104
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
8. Data pre-processing system
EPS4DM Model
6. Intelligent Agents
105
Chapter 6
Intelligent Agents
6.1 Introduction
Intelligence is not an easily defined concept. The Oxford English Dictionary defines it as
“the ability to acquire and apply knowledge and skills”. Merriam-Webster defines it as
“the ability to learn or understand or to deal with new or trying situations” and as “the
ability to apply knowledge to manipulate one's environment” [Oxf13].
The author defines Intelligence as “the ability to understand and to learn from
experience”. More specifically it is the ability to learn or “catch on” without solely being
taught by another agent/entity or being told in what way to by another intelligent
agent. Building upon the aforementioned, intelligence allows for logical reasoning, the
ability to plan, and the ability to solve problems. In essence it is the premise for rational
thinking – the ability to make the correct decisions based on specific conditions. One
gains knowledge through experience and intelligence allows people to correctly apply it
to solve a problem.
6.2 Artificial Intelligence
Artificial Intelligence is the science of understanding the nature of thought and
intelligent behaviour, and to then derive a way to mimic this behaviour in a computer-
based system. Its main objective is to create intelligence in computer-based systems,
either with regard to human behaviour or without reference to it. In the former case, it
attempts to recreate the capabilities of the human mind, in both the ability to make
decisions and to learn. In the latter case, its goal is to create seemingly intelligent
systems which solve tasks for people. An example would be an intelligent search bot,
which sifts through information. Artificial intelligence not only deals with recreating
reasoning and problem solving, but also with other real-life aspects such as Computer
Vision and Natural Language Processing [Rus10].
106
6.3 Intelligent Agent
An intelligent agent is the artificial representation of an intelligent entity that can
interact with its environment, learn from it and use its ability to reason in a way to
achieve certain goals that are vital to its survival in that environment. An agent must be
capable of autonomous action within its environment; it should not rely on human
intervention to operate but rather make use of its own precepts to react to change
[Lon09, Woo09].
An intelligent agent perceives its environment and takes the appropriate actions to
achieve optimal success within that environment. The agents are required to exist for an
arbitrary time, to adapt to changes within their environment and to achieve certain
goals (such as navigating a maze). Agent learning is a core attribute, whereby an agent
will use past experiences and gained knowledge to achieve its goals in an easier and
quicker manner than before. An intelligent agent’s main goals are the following: display
deductive reasoning and make inferences with regard to an operating environment; and
make decisions to accomplish a goal, based on either past experience, current
information or, adversely, the lack of the aforementioned [Rus10].
Agents can learn in any number of ways, including: single-agent learning that allows an
agent to learn from its operating environment (such as learning to successfully navigate
a maze) or from a combination of the aforementioned and another entity within the
environment (such as learning an opponent’s strategy whilst playing a game). Agents
are expected to accomplish their task in the best possible way at all times, even if there
exists some uncertainty, in which case they should opt for the best expected outcome.
6.3.1 Components of an Agent
An agent is essentially a combination of reference architecture and a program
implementing it. A very simple agent may have:
Sensors – These are more accurately described as the agent’s percepts and allow
interaction with the environment.
107
Condition-action rules – They determine what actions need to be taken.
Actuators – They interact with the environment to produce the desired results.
Figure 6.1: Architecture for a simple reflex agent [Rus10]
Figure 6.1 depicts the architecture for a very basic agent. The following is the agent
program for the abovementioned agent:
function SIMPLE-REFLEX-AGENT(percept) returns action
static: rules – a set of condition-action rules
state INTERPRET-INPUT(percept)
rule RULE-MATCH(state, rules)
action RULE-ACTION[rule]
return action
As can be seen, depending on the type of input the agent received from its sensors (or
percepts) an appropriate action is chosen from a set of known rules, which determines
what action the agent should take [Rus10].
6.3.2 Properties of Agents
Agents can exhibit properties, among these as defined by Wooldridge [Woo09] are:
Environment
Agent Sensors
ACTION
Actuator
Condition-action rules
108
Autonomy – an agent must be able to operate without any external intervention
Sociability – agents must be able to communicate with each other specifically in
multi-agent systems
Adaptability – agents should be able to react and adapt to changes within their
environment
Pro-active behaviour – the converse of adaptability, whereby agents should be
able to take action to achieve their goal and not just react to changes
According to Franklin, there are four key attributes that differentiate agents from
simple programs, they are: reaction to the environment, autonomy, goal orientation and
persistence [Fra96]. Agents operate within environments and the next section will
discuss a few environments in which an agent can be applied.
6.3.3. Agent environments
All agents operate within an environment that they perceive. According to Russell
[Rus10] the following different environments can be identified:
Fully observable vs. partially observable – In a fully observable environment the
agent’s sensors will be able to perceive the entire space. In a partially observable
environment only a subset of the environment will be perceived.
Deterministic vs. stochastic – A deterministic environment is one in which the
next state will be determined by the agent’s actions (such as in a game of chess)
whereas a stochastic environment produces more random changes between
states.
Static vs. dynamic – In a static environment there are no external changes
between agent actions. In a dynamic environment there could be changes that
occur while agents take actions.
109
Discrete vs. continuous – A discrete environment will have a finite set of possible
states, while a continuous environment could have an infinite number of states.
Single agent vs. multi-agent – A single agent environment is one in which there is
one agent trying to achieve its goals. In a multi-agent environment there are
many agents, each working together to achieve a common goal.
The following section provides an overview of the different agent types.
6.3.4 Agent types
A list of the different types of agents that can be employed in an intelligent system will
now be provided:
Reflex agent: An agent whose actions rely solely on its current percept, not
taking into account any previous percepts. Figure 6.1 (see page 107) depicts a
reflex agent.
Model-based reflex agent: This agent keeps internal state information that
relates to the current environment in which it operates (essentially it has a
“model” of its environment which it continuously updates). A model-based reflex
agent is defined as an agent that makes use of such a model.
Goal-based agent: This agent defines knowledge about which goals it has to
achieve in its environment. It chooses its actions to achieve these goals.
Utility-based agent: This agent relies on a utility function that will map its
current state to a real number in order to decide whether the agent will be
“happy” if its current goal is achieved. This allows for the creating of trade-offs
when conflicting goals arise.
Learning agent: This is an agent that uses its past experiences and learns from
them in order to improve its performance [Rus10].
110
6.4 Multi-Agent Systems
In a Multi-Agent System (MAS), the focus is on distributed computing. A specific agent
will be created for each type of problem that has to be solved. This makes agent
implementation easier as one agent only needs to focus on solving one problem in the
system. A MAS consists of a network of intelligent agents, where each individual agent
has to focus on its own task, but the agent will need to interact with other agents to
solve the greater problems at hand [Wei00, Sho08, Lon09].
The introduced problem is that agents now need to communicate with each other to
complete all the goals required by the system to produce its desired results. To obtain
this purpose, an interaction protocol will need to be envisioned that closely relates to
the environment and the goals the intelligent system finds itself in. The concept of a
MAS can be related to the way in which a team of individuals work together to achieve
their set objectives.
The common agent characteristics are mentioned in the next section [Kep10].
6.4.1 The characteristics of the agents in a MAS
The characteristics of the agents that form part of a MAS are [Kep10]:
Agent autonomy – each individual agent is required to operate on its own.
Localisation of percept information – individual agents do not have a view of the
entire global system.
Decentralisation – the system cannot be controlled from a single point, each
agent works together with others to achieve their set goals.
The following section describes some advantages of making use of a MAS.
6.4.2 The advantages of using a MAS
The following are advantages of making use of a MAS as opposed to single agents
[Uhr09, Car12a]:
111
A system making use of single agents will be subject to performance bottlenecks,
single point failures, as well as resource limitations on the hardware
infrastructure. A distributed MAS is decentralised and it can be tailored to avoid
these problems.
The data and information required by the agents is often spatially distributed
and thus a MAS allows for the effective retrieval and global coordination of this
data for use by each agent in the MAS.
A MAS more accurately mimics real-world task allocation. In a team situation,
each member is assigned a task to work on and contributes to the global success
of the team’s goals. Similarly a MAS allows for the use of multiple interacting
agents working together to solve a problem.
In situations where legacy systems cannot be rewritten, an agent wrapper can be
created for the system and this agent can then be incorporated into a MAS.
A MAS allows for increased performance by increasing the computational
efficiency of the tasks at hand and add characteristics to the overall system such
as: reliability, flexibility, load-balancing and scalability.
The next section describes how agents interact and cooperate within a MAS.
6.4.3 Interaction and cooperation in a MAS
Agents within a MAS are required to interact with each other and the most common
form of communication is via signals (or messages) that have defined interpretations.
Multi-agent planning is a way of ensuring that agents interact according to some set of
rules. Social laws can also be used whereby each agent is aware of what they can or
cannot do [Tab09, Rus10].
An example of this would be a vehicle agent that is aware that it needs to drive on the
right-hand side of the road in its environment. Agents do not always collaborate and
could sometimes be in competition with each other. Agents in an environment that
compete against other agents are aware that there are other agents and that their
112
actions will influence each other in such a way that one would be successful and another
not [Tab09, Rus10].
In a cognitive MAS agents could interact by making use of a principle from speech act
theory. This concept relies on two rules pertaining to the messages passed, namely the
content of the actual message and most importantly the intention of the message. The
Knowledge Query and Manipulation Language (KQML) is a common message passing
protocol used in a MAS. The Knowledge Interchange Format (KIF) is used within the
protocol and describes first-order logic rules pertaining to the message [Tab09].
In order for agents to cooperate within the MAS, they need to have an aligned set of
goals. The two main approaches are that of cooperative and self-motivated agents.
Cooperative agents each have their own set of capabilities to solve the task they have
been assigned. When combining multiple cooperative agents each agent solves their
assigned problem and thus contributes to the global solution of the problem, thus they
are working as a team to solve a common goal. Self-motivated agents have a set of goals
that they need to achieve and often compete against each other to achieve them. In
order to achieve proper agent coordination each agent must be aware of the existence
of other agents and thus coordinate their actions with each other accordingly [Tab09].
6.5 Conclusion
Intelligent agents allow intelligent computing systems to be created. These are systems
that can solve complex problems. At the centre of any intelligent system is the agent, a
self-contained unit that can interact with the problem environment or other agents and
produce the required actions to meet the system’s goals. Multi-agent systems allow for
the distribution of intelligent agents within an environment to solve complex problems,
not only does this bring increased performance, but it can also create a system with
better scalability and increased robustness.
As alluded to in Chapter 5 the EPS4DM model aims to perform the data pre-processing
tasks by following three steps, namely data extraction, clustering and classification. To
this end the EPS4DM model makes use of cooperating agents to control each of these
actions. There are three main pre-processing actions namely denormalisation,
113
classification and clustering, each of which is a goal that needs to be achieved within the
EPS4DM model.
Chapter 6 has provided an overview of intelligent agents, their properties and the
environments they can be deployed in.
Chapter 7 introduces Computational Intelligence, a branch of Artificial Intelligence that
studies nature’s approaches to solving problems. The techniques discussed in Chapter 7
will be form the basis of an agent’s action program.
114
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
115
Chapter 7
Computational Intelligence
7.1 Introduction
Chapter 6 provided an overview of intelligent agents, as they form the basis of the
operation of the EPS4DM system. Their functionality is however dependent on methods
found within the branch of heuristic algorithms known as Computational Intelligence
(CI). CI deals with adaptive environments and biological-inspired methods. Normal
techniques often fall short and cannot solve complex problems as accurately as CI based
techniques can. The most popular CI techniques are Artificial Neural Networks,
Computational Swarm Intelligence, Fuzzy Systems and Evolutionary Computation.
Chapter 7 will provide an overview of Computational Swarm Intelligence and the
Genetic Algorithm (GA). The EPS4DM model makes use of a Particle Swarm
Optimisation (PSO) algorithm for its clustering operation and a GA for its classification
and dimensionality reduction operation.
7.2 Computational Swarm Intelligence
Computational Swarm Intelligence is based on the natural emergent behaviour that is
evident in self-organised systems, such as shoals of fish, flocks of birds or a colony of
ants [Kar09]. There are two main approaches in Swarm Intelligence (SI), namely
Particle Swarm Optimisation and Ant Algorithms [Eng07, Kir13].
7.2.1 Particle Swarm Optimization
Particle Swarm Optimisation (PSO) is an optimisation technique that is inspired by the
collective intelligence found within swarms, such as a flock of birds or a shoal of fish.
The idea is that particles move around within an n-dimensional search space of the
problem domain, and each particle searches for an increasingly improved solution. Each
particle remembers its own best solution and evaluates it with the best solution found
by the swarm (or a subset thereof), i.e. it uses both its own information as well as its
neighbourhood’s information. These particles each have a velocity 푣, with which they
116
move from their best solution to the next best solution as determined by a fitness
function. As a result, the swarm will converge to the best global solution of the problem
[Bla02, Chi05, Mur06].
The components of a PSO are discussed in the remainder of this section:
i. Particle Velocity
Each particle has a velocity at which it travels within the search domain. Depending on
the type of PSO algorithm employed, the components of the velocity vector 풗, are
[Eng07, Ols12, Kir13]
풗(푡 + 1) = 푤풗 (푡) + 풄 (푡) + 풔 (푡)
Equation 7.1 – Particle velocity
Equation 7.1 presents the base global-best velocity update equation as used in the gbest
PSO algorithm. Its constituents are [Eng07, Ols12, Kir13]:
Local component (particle’s current velocity) – 풗 (푡).
Inertia weight, controlling the trade-off between the exploration, 풗 and the
exploitation 푤.
Cognitive component: 풄 (푡) = 푐 푟 (푡) ∙ [풚 (푡) −풙 (푡)](taking into account
the particle’s personal best solution found so far as component, 풚 (푡)).
Social component: 풔 (푡) = 푐 푟 (푡) ∙ [퐳 (푡)−풙 (푡)] (taking into account the
swarm’s global best solution found so far as component, 풛 (푡)).
The velocity update occurs for particle 푖 in dimension 푗 of the domain at time interval 푡.
The other components are [Eng07, Ols12, Kir13]:
The particle’s current position in the domain – 풙 (푡).
푟 , 푟 , are stochastic elements chosen from a standard Gaussian distribution.
117
푐 , 푐 , are acceleration constants governing the stochastic elements.
7.2.2 Particle Position
As mentioned in Section 7.2.1 the particle’s position, 풙 (푡), forms part of the velocity
update calculation as its distance from more optimal search regions are taken into
account. The particle position represents one possible solution to the problem domain
and is used in fitness evaluations. In its most basic form the particle position update is
the simple summative component [Eng07, Ols12, Kir13]:
풙 (푡 + 1) = 풙 (푡) + 풗(푡)
Where 푣(푡) is the current velocity as calculated in Equation 7.1.
7.2.3 Exploration vs. exploitation
The velocity equation plays a key part in the PSO algorithm; after all, this is what drives
particles to move. However, particle velocities can quickly become exponential,
resulting in large step-sizes that are taken in each iteration. Exploration is required in
the initial stages of optimization to allow the swarm to search many areas of the
domain. Exploitation is needed in the later stages, allowing the refinement of solutions
within a certain region of the domain. Section 4.2 investigates this trade-off [Eng07,
Sim13].
7.2.4 Fitness evaluation
In terms of mathematical minimisation, the fitness function is the problem that needs to
be optimised. In the case of data mining, Chapter 8 will investigate the details of how
PSO can be applied to the data mining pre-processing tasks.
7.2.5 Basic Algorithm
In its simplest form, the basic PSO algorithm will consist of two procedures [Eng07,
Ols12, Kir13]:
Equation. 7.2 – Particle position
118
Step1: Calculate the particles’ best positions
Making use of the fitness function defined for the problem domain, the best positions
are calculated for:
Each particle at iteration 푖 – 풚 (푡) in Equation 7.1
The entire swarm at iteration 푖 – 풛 (푡) in Equation 7.1
Step2: Move particles
Calculate the new velocity of a particle at iteration 푖 using Equation 7.1.
Calculate the new position for a particle at iteration 푖 using Equation 7.2.
This process introduces two concepts, namely synchronous and asynchronous updates.
The base algorithm uses synchronous updates, whereby the best positions are
calculated per iteration and then particles are moved. Asynchronous updates allow for
the best positions to be calculated immediately after a position update has occurred,
thus providing immediate feedback to the algorithm [Eng07].
Steps 1 and 2 will be repeated until the swarm has converged on a best solution, or
more accurately, when the swarm stopping conditions have been met. Figure 7.1 depicts
the way in which a few particles will move at a velocity 푣 for each iteration.
Figure 7.1: Particles moving towards the new global best [Eng07]
7.2.6 Initialization
To get the algorithm going, the initial positions for each particle are chosen using a
푣 푣
푣
119
stochastic method and the velocity set to 0. This initial position is also set as the
particle’s current best position. This allows the velocity update equation (Equation 7.1)
to have an initial position with which to start at the first iteration of the algorithm.
Initialisation is more important than it seems, since a bad initialisation of particles can
influence the swarm’s chances of finding the global minima.
7.2.7 Stopping conditions
Specific stopping conditions for PSO include capping it at a maximum number of
iterations, stopping if there is no significant improvement in the solution for a number
of iterations or when the swarm radius is small enough to assume that the swarm has
converged [Sim13].
7.2.8 Charged Particle Swarm Optimization (CPSO)
CPSO is a popular PSO-based algorithm that offers improvements over the basic gbest
implementation. In this method, each particle within the swarm is given either a
positive or negative charge, instead of just a generic velocity. Each particle’s velocity is
either 0 (neutral) or a value greater than 0 (positive) or a value smaller than 0
(negative) The aim with CPSO is that particles with similar charges will repel each
other, in the hope that they will find other (and possibly improved) solutions within the
domain.
CPSO aims to solve the problem of lost diversity. If the diversity factor drops below a
specified minimum, the particles moving away from each other, rather than towards
each other ensures that more possible solutions are explored [Mur06].
The normal procedure of PSO will be followed. However, once a specified diversity
factor (which regulates and controls how much of the search domain particles have
explored) drops below a specified minimum, the following will happen, as shown in
Figure 7.2:
Particles are assigned a value that represents a charge, as described above; this
charge relates to the individual fitness evaluation of each particle.
120
Instead of particles moving toward the determined global best solution, they will
be repelled from each other.
This will ensure that there is enough diversity in the search domain in order to avoid
premature convergence at a locally determined optimum that is perceived as the global
optimum. Therefore, other possible solutions are explored [Mur06].
Figure 7.2: Charged particles being repelled [Eng07]
In ensuring that there is adequate diversity within the search domain, it will give the
PSO algorithm its best chance of finding the global optimum. The multi-swarms
technique is another technique that is used to achieve this.
7.2.9 Multi-Swarms
The problem of lost diversity can also be solved by employing multiple swarms within
the environment. This large colony of swarms can then interact with each other in the
same way individual particles do in order to find the optimum. However, much care
must be taken to ensure that swarms do not readily converge on the same solution
within a highly dynamic environment [Bla02, Fon09].
In order to solve the problem of outdated memory, authors suggested that particles
should do a global re-evaluation of the fitness function in order to ensure that their best
solution is still the same solution at each iteration of the search [Bla02].
As mentioned a multi-swarm consists of a colony of swarms. In this adaptation, an
atomic analogy of electrons moving around a nucleus can be made: a swarm consists of
a normal PSO swarm and particles that swarm around it to find other best optima
푣
−푣
−푣
121
within the same region. The idea is to place a swarm on each best local optimum within
the entire domain. The particles swarming around a peak can keep track of how and
when it changes, hence allowing an up to date recollection of the best optimum
(solution to the outdated memory problem). These swarms can then work together to
converge at the global optimum [Bla02, Fon09].
7.3 Evolutionary Computation
Section 7.2 introduced the concept of swarms and particles by way of an optimisation
technique which does a heuristic search of the said domain. Evolutionary Computation
is a branch of Computational Intelligence that is based on the Darwinian concept of
evolution. More specifically it relies on the theory of the survival of the fittest, which
states that creatures keep the features that allow them to survive in their environment.
Conversely, they also get rid of the features holding them back [Kar08].
Systems developed on the concepts of EC will employ evolutionary processes including
natural selection, reproduction, mutation and the survival of the fittest. They often make
use of populations of individuals that are evolved in a search space [Eng07].
7.3.1 Genetic Algorithms
A Genetic Algorithm (GA) is a stochastic search technique originally described by
Holland in 1975 and is based on the Darwinian theory of natural selection. The
biological theory states that organisms adapt in order to survive and this evolution is
driven by the notion of allowing the fittest organisms to survive and the weaker ones to
die off [Sim13].
The following sections will describe the important aspects of the GA:
i. Search Space
The search space X must be represented in an appropriate way for the GA to be able to
traverse it, hence the vector X will be represented by some string s of length l which
exists of elements from an alphabet A, which defines the mapping [Hau04, Sim13]:
122
푐 ∶ 퐴 → 푿
This can now be formulated into an appropriate search space:
S ⊆ A
The length of the string l, depends on the dimensions of both X and A. In the context of a
GA this mapping is called the genotype-phenotype mapping and the elements of this
mapping are called genes and the actual values are called alleles.
Thus the GA employs a population of strings representing possible solutions to the
search domain and is often referred to as chromosomes, and it is common practice to
make use of binary strings as representations [Sim13]. These chromosomes are subject
to certain operations that will be defined next.
ii. Crossover
The crossover of two pairs of chromosomes produces a child chromosome that will
have the characteristics of the parents. This is a simple process of replacing a set of
genes in one parent from those of the other parent. Figure 7.3 depicts a binary string
chromosome with a randomly selected crossover point. The sub-set of genes is then
swapped to form the child [Hau04, Aff09, Sim13].
1010110010101 011100100010 1011001010011 010100101010
1010110010101 010100101010 1011001010011 011100100010
Figure 7.3: Crossover operation of two Chromosomes [Eng07]
Randomly chosen crossover point
1st parent’s genetic code 2nd parent’s genetic code
1st offspring’s genetic code 2nd offspring’s genetic code
Chromosomes after crossover
123
iii. Mutation
The mutation of two pairs of chromosomes will produce a child chromosome that is
nearly identical to both parents as there are only a few changes to the chromosome. In
this operation a subset of genes are chosen at random and the allele values of those
selected are changed. For a binary mapped string this is as simple as calculating the
complement of the bits. Referring to Figure 7.3 the 1st parent’s genetic mapping with
genes 3 and 5 selected will now become: 1000010010101011100100010 [Hau04,
Eng07, Aff09, Sim13].
iv. Fitness function
As is the case with the PSO described in the previous section the GA also requires a
fitness function that will evaluate each chromosome’s fitness [Tra09]. The fitness
function and its results are used as stopping conditions for the GA algorithm which will
be defined next.
v. GA algorithm
The GA makes use of the operators defined to manipulate a set of chromosomes and can be defined in the following algorithm [Aff09, Sim13]:
Select initial population of chromosomes;
while termination conditions not met do
repeat
if crossover condition met then {select parent chromosomes; choose crossover parameters; perform crossover;}
if mutation conditions met then {choose mutation points; perform mutation;}
evaluate fitness of children
until sufficient children created;
select new population;
endwhile
124
This pseudo-algorithm serves as a basis for any GA implementation and refers to user-
specified crossover and mutation conditions.
7.4 Conclusion
Chapter 7 has introduced two methods within CI namely the PSO and the GA. The PSO is
based on swarm intelligence and draws on the biological inspired notion of flocks
whereby particles move through a search space and find optimal solutions.
The GA is based on the notion of evolution whereby solutions to the problem are
defined and are manipulated with biological inspired operators in order to find optimal
solutions within the search space.
As discussed in both Chapters 5 and 6, the CI-based techniques that were introduced in
this chapter form the basis of the classification and clustering processes that will be
used in the EPS4DM model.
Chapter 8 serves as an overview of the proposed EPS4DM model and presents the data
pre-processing operations based on the algorithms defined within this chapter.
125
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
126
Chapter 8
The EPS4DM model
8.1 Introduction
The data mining pre-processing steps as discussed in Chapter 5 that will be examined in
this chapter can be grouped into clustering and classification as a dimensionality
reduction technique. Chapters 6 and 7 introduced the tools required to construct the
pre-processing system that will be defined in this section.
The purpose of this chapter is to provide an architectural overview of a data pre-
processing system to transform data in an intelligent manner to enhance its suitability
for data mining operations. The Extract, Pre-process, Save for Data Mining (EPS4DM)
model that will perform the pre-processing tasks required on a chosen dataset is
introduced. It will also transform it into the formats that can be accessed by data mining
algorithms from a data mart. Both data clustering and classification techniques require
the relational data to be denormalised into feature vectors.
This chapter also elaborates on how multi-agent technology is used to orchestrate the
processes of extracting the relational data into the formats required by the classification
and clustering processes and how the results of these sub-processes are used to
maintain the data mining mart.
The next section will provide a comparison between the traditional ETL processes that
are used and the one proposed by the EPS4DM model.
8.2 Traditional ETL compared to the EPS4DM model
As discussed in Chapter 5, ETL processes are required to extract data from relational
tables, transform them into the formats required (a star schema) and then do a bulk
load of the current live data snapshot into the warehouse. These tasks are usually run
once a day. When data mining tasks must run, a selection of the source data and where
it needs to be aggregated from must be determined before the pre-processing tasks can
127
Relational Database subset
EPS for Data Mining (Classification, Clustering) - pre-processing
Data Mining MartStored pre-proccessed data for the subset
Reduced subset size
be run. This is depicted on the left-hand flowchart in Figure 8.1.
Figure 8.1: Traditional ETL compared to the EPS4DM model
Figure 8.1 depicts the traditional ETL process on the left-hand flowchart where it can be
seen that the data selection and pre-processing tasks occur late in the process as it is
Relational Database subset
ETL for MISDepartments select
structures for MIS reporting
Data Warehouse- Fact TablesStill highly structured(Usually One day old)
Subset Selection, data pre-processing tasks
Data mining Algorithms Extract rules from dated data
Data mining Algorithms Extract rules from pre-processed relational data
Data Marts
Data Warehouse
Denormalisation PSO Clustering GA Classification
(b) EPS4DM model
(a) Traditional ETL
RDBMS RDBMS
128
driven by the need to perform data mining on the existing data. The right-hand
flowchart depicts the proposed EPS4DM model’s approach whereby the selection of
data and the pre-processing thereof is done beforehand. The EPS4DM model proposes
that the data mining pre-processing tasks can be setup to run from the relational
database. By storing the processed data the rule extraction algorithms have a repository
that they can run from. This process is depicted in the right-hand flowchart of Figure
8.1.
Sections 8.3 and 8.4 provide an overview of how agent technology is used to orchestrate
this process.
8.3 Clustering
As defined in Chapter 5, clustering is the process of detecting subsets of “similar” data.
The exact nature (or structure) of the data is not known; in other words the
programmer does not know how to classify the data as part of a specific class, and hence
the aim is to group it into clusters that have similar characteristics. This section will
provide an overview of a semi-supervised PSO-based data clustering technique to
extract data from multiple source tables in a relational database and to effectively
aggregate this data into the data store.
As discussed in Chapters 4 and 5, the relational database is based on relations. The
target format for a data mining algorithm is fundamentally different as it requires data
to be stored in an attribute-value format. Effectively this constitutes a denormalisation
of the original relational structure and this is the first step in the clustering process as
defined in Figure 8.2.
The clustering process can be broken down into three constituent parts, namely:
Dataset selection
Data extraction and denormalisation
Cluster centre calculations and dataset clustering
129
8.3.1 Data extraction and denormalisation
In the first step, the user selects the set of target tables from the relational database
which contain the data that will be clustered. A key concept that will be discussed in
Chapter 9 is that of the associativity between target records. This is shown in Figure 8.3.
The target tables will inevitably contain links to data in non-target tables and users wish
to break down this high degree of associativity for pre-processing purposes.
The relational structure as depicted in Figure 8.3 needs to be transformed into a set of
Figure 8.2: The clustering process as used in the EPS4DM system.
Relational Database
Data Extraction
Data Transformation
Feature Vectors
Cluster Center Calculations
Outlier Detection
(cluster purity)
Clustered Data
130
feature vectors that will become the search space for the clustering process.
Figure 8.4 shows the structure of the data extraction and denormalisation process.
There is an active RDBMS interface which is setup to monitor the selected tables in the
database. When its trigger conditions are met (it could be each time a certain amount of
records are added, or when a specific time of day is reached) it will signal to the data
extraction agent that the data in this dataset structure needs to be extracted.
Foreign Key
R R R R …
R
Figure 8.3: Associativity between target records [Alf07, Ray10].
Figure 8.4: Deriving feature vectors from relational data
The data extraction agent will build an in-memory model of the required relational
structure for this specified dataset. This model is then passed to the data transformation
agent that will produce the feature vectors for this dataset and persist with them for use
by the clustering agent. The data change signal is a core element as this will trigger the
T T T T T T T
T T T
Data extraction Agent RDBMS RDBMS Interface
Data change signal
Data transformation Agent Mart Interface
Feature Vectors
131
data extraction agent which will then orchestrate a new set of feature vectors to be
derived for the dataset.
8.3.2 Cluster centre calculations and dataset clustering
As mentioned in Chapter 5, the first part to producing a clustered dataset is to find
cluster centres. These centres form the base reference points from which other datum
points are grouped into clusters. A PSO-based cluster centre computation is proposed
and the details of this will be discussed in Chapter 10.
The PSO algorithm will effectively produce a set of cluster centres for the set of feature
vectors which was derived in the extraction process overviewed in Section 8.3.1. The
Euclidean distance between each data point and its cluster centre is used to determine
which cluster the data point belongs to.
Figure 8.4 gives an overview of the way in which the relational data is firstly
denormalised by building the feature vectors, persisting with these for the given data
set so that the clustering agent can use that as its source.
Figure 8.5: The data clustering process
Figure 8.5 depicts the data clustering process that will orchestrate the clustering
process:
Feature Vectors
Data clustering Agent
Clustering controller Clustered Data
Cluster centroid calculation Assigns feature vectors to clusters
Cluster quality check Persists results
132
A cluster count K is read from configuration data.
The clustering controller will fetch the required vectors. A call is then made to
the PSO-based clustering agent which will derive the K cluster centres and will
place the feature vectors in the K clusters
The clustering controller will run the cluster quality checks to determine how
well the data has been clustered.
The controller then decides to adapt the value of K, to recalculate cluster centres
and re-cluster the dataset or to persist with the results to the clustered data
store.
8.4 Classification
As introduced in Chapter 5, classification is the process of mapping a specific data
element to a pre-determined class. In order to achieve this, the data needs to be
compared to a known set of classes.
When the number of variables in each dataset is very high this becomes a challenge as
the high-dimensionality of data, as discussed in Section 5.4.4, exponentially increases
the processing time required. A linear technique to address this issue was introduced in
Section 5.5.1 namely the Principal Component Analysis (PCA) whereby the most
prominent components are selected for the dataset.
This section will introduce the proposed model for a Genetic Algorithm-based data
dimensionality reduction solution for the classification process that will be defined in
Chapter 11. This process is defined in two parts, namely: Feature vector creation and
calculating the reduced set.
8.4.1 Building the feature vector
As discussed in Chapter 7 the first part of setting up a GA is to determine the
chromosome mapping and in the case of relational data this is done by determining a
feature vector.
133
Figure 8.6 indicates the process of feature normalisation which will normalise the
values of a feature vector and save the class associated with the feature (in the case of a
training data set).
Figure 8.6: Feature Vector normalisation
The feature vector creation normalisation process will make use of the feature vectors
derived from the process described in Section 8.3, but for the clustering process a class
label needs to be assigned to each feature vector. This will be used as training data.
Figure 8.7: Building the normalised feature vector
Figure 8.7 shows the creation of the normalised feature vector. The feature
normalisation agent will select the features from the feature vector store and normalise
the data. It will then persist with the normalised vectors in combination with their class
labels (when available) for the given dataset in the feature vector store which is
exposed via an interface.
Feature Vectors
Feature Normalisation Agent
Mart Interface
Feature vector
Normalised Feature vector with class label
Feature Vectors
Feature normalisation
Normalised Feature Vector
Feature Class
134
8.4.2 The classification process
As discussed in Section 5.5 dimensionality reduction is the process of either selecting or
extracting an optimal set of features from that dataset that will still accurately describe
that dataset. Formally, it can be defined as: given a set of A dimensional input patterns,
produce a transformed set of patterns B, where B<A.
This is achieved by a transformation matrix which will map A to B, thus the goal of the
dimensionality reduction algorithm is to discover this mapping matrix.
The goal of the GA is to produce a transformation matrix that will ensure that the
original set of features gets mapped to the reduced set. The GA must maintain a
population of competing feature transformation vectors which will map the original
features to a reduced set.
Figure 8.8 depicts the process through which the feature vector serves as an input
pattern A to the GA. The GA will build and evolve the transformation vectors 푯 .. 푯
where z is the population count of the GA. The new set of features, B, is calculated from
the chosen transformation vector 푯 and fed to the classifier.
Figure 8.8: Building the transformation vectors [Far07]
The classification process that is driven by the classification controller is shown in
Figure 8.9:
The value of the K neighbours to examine by the classifier is read from the
configuration data.
Input Patterns
A
Transformed patterns
B = 푨 ∙ 푯 Classifier
GA with transformation vectors 푯 .. 푯
135
The classification agent retrieves the training data set with class labels as
normalised feature vectors from the feature vector store. It also retrieves the set
of vectors that has to be classified.
The classification agent maintains a set of weighted transformation vectors that
it uses to transform the dataset and sends the resultant vectors to the classifier
which will classify the set against a known test set for the dataset.
The controller then performs the classification purity checks and determines if
the GA needs to run again to build a more optimal reduced vector set.
Figure 8.9: The classification process
8.5 Overview of the data mining mart
As discussed in Sections 8.3 and 8.4 there are two types of pre-processed data that are
maintained in the data mining mart, namely clustered and classified data. Once these
processes have been setup for the required tables the triggers will allow the new data to
be calculated and persisted in the store.
Feature Vectors
Normalised Feature vectors with class label as training set
Classification Agent Classifier
Classification controller Classified Data
Normalised Feature vectors to be classified
Transformation vector
Classification quality check persists results
136
Figure 8.10 depicts the data mining mart with the pre-processed data derived from each
process. The cluster centres and feature vectors are transient data required by the sub-
processes and act as triggers for new data changes in the relational database. Once the
trigger conditions are met the transient data is calculated and stored, allowing the
classification and clustering processes to be run to produce new up-to-date pre-
processed data.
Figure 8.10: The data mining mart
8.6 Conclusion
Chapter 8 has introduced the EPS4DM model which incorporates the data pre-
processing tasks of clustering and classification as means of dimensionality reduction.
The EPS4DM model aims to extract the data required from a relational data subset
chosen by the user and produces the required feature vectors that describe the search
space. The clustering and classification tasks are then run on the chosen dataset and the
resultant clustered data as well as the possibly reduced classified set are stored in a
data mining mart. When significant data changes occur on the target data subset, the
agents will trigger the update of the feature vectors and the pre-processing tasks that
are run on the new set to update the data mining mart. This effectively allows the data
mining algorithms to refer to this mart as their repository meaning that they can access
pre-processed data as and when it becomes available. In comparison to the more
traditional approach this allows for a more active approach to data mining than having
to aggregate the data required once the organisational decision is made to pursue data
mining.
Relational data
Clustered data
Classified data
Data mining
mart
Centres
Classes + Reduction vector
137
Chapter 8 served as an introduction to the components defined in the pre-processing
system, namely data extraction, clustering and classification. The data mining mart was
described in this chapter and shows how the transient data is calculated when changes
are made to the relational database that triggers the pre-processed data to be re-
processed.
The following three chapters describe each of the three sub-tasks in more detail.
Chapter 9 describes the data extraction process in detail and depicts the construction of
the feature vectors. Chapter 10 provides a computational and mathematical model for
the PSO-based clustering process. Chapter 11 describes the feature vector creation
process and explains the computational and mathematical models for the GA-based
classification process.
138
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
139
Chapter 9 The EPS4DM’s data extraction process
9.1 Introduction
As defined in the preceding chapters, the traditional method of populating a data
warehouse with production data is to follow the ETL process. In this process data is
categorically extracted, transformed into the appropriate format and loaded into the
target warehouse. This process specifically requires data from multiple tables to be
loaded into a single table via the transformation process.
The inherent problem with this method is that the source tables are subject to a high
degree of one-to-many associations. The traditional transformation processes result in a
loss of accuracy directly proportional to the number of tables and the degree of
associativity between them.
Chapter 8 provided an overview of the EPS4DM model that handles the data extraction
from a relational table. The concept of a feature vector was also introduced. Chapter 9
will provide more detail on this process as well as an overview of how data will be
extracted from multiple source tables in a relational database.
9.2 Mapping relational structures
As discussed in Chapter 2, dimensionality reduction is a phase of the data mining
process in which the size of the target data is reduced in order to allow a rule engine to
effectively run on a smaller dataset. As discussed in Chapters 4 and 5 the relational
database is built on relations. The target format for a data mining algorithm is
fundamentally different as it requires data to be stored in an attribute-value format.
Effectively this constitutes a denormalisation of the original relational structure.
The target table for any specific data mining process will inevitably contain records that
are associated with one or more records in other non-target tables.
140
A high degree of associativity results in a very large target dataset often subject to the
multiple-instance problem (where an instance refers to a row in a table) [Alf07, Ray10].
In the target table, a single record 푅 can have a high degree of associativity with other
records stored in the non-target tables as shown in Figure 9.1.
Foreign Key
R R R R …
R
Figure 9.1: Associativity between target records combined as feature vectors [Alf07].
Let R denote a dataset from the target table and T denote the associated n records in the
non-target table with T = (푡 , 푡 , 푡 , … , 푡 ). Let 푇 be a subset of T; 푇 ⊆ 푇 where it is
associated via a foreign key 푅 to a single record 푅 in the target table. The non-target
table can thus be described as a table consisting of rows of objects, where a subset of the
rows can be linked to a single object in the target table. The vector space model allows
the records associated with a target record to be represented as a feature vector,
informally known as bags of patterns as illustrated on the right-hand side in Figure 9.1
[Alf07, Kuo09, Ray10].
There are two types of data that can be represented in a feature vector, namely numeric
and mixed data. Sections 9.2.1 and 9.2.2 provide an overview of each.
9.2.1 Numeric data
When the target data in question is numeric in nature the feature vector is constructed
directly from the data. The features of the target table in question are aggregated and
form the initial feature vectors X:
푿 = {풙 ,풙 ,… ,풙 }
Each feature vector 풙 , is of size N, where N is the number of features in the dataset.
T T T T T T T
T T T
141
When numeric data is aggregated this data will be normalized in the range [1.0, 10.0]
where 푤 is a weight associated with the feature 풙 . The normalisation formula is used
to perform the weighting of each feature:
푥 =푥 −푥
푥 −푥 ∗ 9 + 1
When dealing with numeric data only the weighted feature set must be used. The
reasoning behind this is the exposure of the outliers in the data, because when features
are closely bundled together the differentiation between classes is more difficult.
Figure 9.2 depicts data points of three different classes: the image to the left is mapped
as normalised values whereas the image on the right contains a weighted normalised
data set [Ray00].
Figure 9.2: The effect of feature weights to expose outliers [Ray00]
9.2.2 Mixed-type data
When the target table contains non-numeric values such as Boolean or type descriptors,
the normalisation process described in Section 9.2.1 does not apply. The feature vector
is constructed by representing each value as an integer that represents the index of each
feature from the target table. Figure 9.3 represents a feature vector that contains a mix
of numeric, string and Boolean data represented by the feature indexes [Far07, Pei98].
The problem of multiple instances arises when tables are denormalised. This may result
in more than one unique record in the dataset, which essentially results in a larger
dataset to search.
142
The following section provides a method to deal with multiple instances for a feature.
1 35 21 12 8
Figure 9.3: Example feature vector with indexes to complex types
9.2.2 Handling multiple instances
In a simplistic database, a target record 푅 can be associated with a non-target record
that has a single attribute. In this case the single associativity approach can be used in
which the feature generated contains only one attribute mapping [Alf07, Atr, Ray10]. As
seen in Figure 9.4, the non-target Income_Stats table contains age, annual income, a
foreign key for the country and an income class.
Income_Stats Age Annual
Income Country Income
Class 25 75k 1 1 27 85k 2 2 25 75k 1 1
Age Income Country Mapping Income Class 25 75k 20 1 27 85k 11 2
Figure 9.4: Example of a feature pattern being built
The cardinality of the non-target domain needs to be determined first, which constitutes
the amount of unique values that a specific attribute can take. In this case the cardinality
is two as there are two options available in the Country column. The non-target
sequences need to be encoded as binary numbers, in order to ascertain the appropriate
number of bits required.
Country ID Country 1 USA 2 UK
land num_access_file
service count dst_host_name
143
The following can be derived for n:
2 < 퐹푒푎푡푢푟푒퐷표푚푎푖푛 ≤ 2
Calculating from the data presented in Figure 9.4: 2 < 2 ≤ 2 ; n=1. This results in
the following binary mapping: (0, 1) representing the country options USA and UK. The
pattern for each instance of the feature is maintained for the non-target table, a counter
is used for each FK entry, either incrementing the count of an existing pattern or adding
the pattern to the vector if it does not already exist. The number prefixed to each
pattern, represents the frequency of its occurrence. Referring to Figure 9.4 it can be
seen that the two entries for the duplicate income class 1 entry have been grouped into
one record [Alf07, Atr, Ray10].
Calculating the number of bits required to represent the feature instances is crucial as it
is used to determine the counter value of the representation. In the simple example
given, the duplicate entry for country USA has been allocated a country code of 20
where the 0 represents the country feature and the 2 the instance count. The n number
of bits is 1 hence the last n bits of the mapping represents the feature, so the remaining
1 bit is used as the counter. If the value were to be 1000 this would imply the first 3 bits
are used for the counter, hence there would be 100 instances of feature 0.
9.2.3 Multiple instances with more than one feature
A real-world dataset will most definitely have a non-target table that contains multiple
attributes. A similar approach to the process described in Section 9.2.2 is followed, but a
feature mapping is built for each referenced feature.
9.3 Process overview
As described in Chapter 8, Figure 9.5 provides an overview of how the agents in the
EPS4DM model will handle the data extraction process, which is the first step of the pre-
processing process.
144
Figure 9.5: Data extraction process
9.4 Conclusion
Chapter 9 has provided an overview of the proposed data extraction and
denormalisation processes used by the EPS4DM model. This chapter has shown that the
data required by both the clustering and classification tasks can be derived from a
relational table by means of building a feature vector.
The nature of the data in the target table will determine the type of feature vector that
can be built. There are essentially two types of feature vectors, namely numeric and
non-numeric data. When numeric data is used as input to build the feature vectors the
data will be normalised in order to expose outliers. In the case of non-numeric data two
options are presented, namely to use the index of each feature as the value or when the
dataset could result in multiple instances to use the pattern approach to define the
feature. By making use of the pattern approach the dataset can initially be reduced as
multiple instances of the same record that are aggregated.
As discussed in Chapter 8 the feature vectors are saved in the data mining mart. The
pre-processing tasks of clustering and classification that both rely on the feature vector
will be described in Chapters 10 and 11.
Relational data
Data extraction agent
Data transformation
agent
Data change
Pass relational data
Vector store
Feature Vectors persisted
145
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
146
Chapter 10
EPS4DM’s clustering process
10.1 Introduction
Chapter 9 provided an overview of the way in which the feature vectors are derived
from a relational database. These vectors are then persisted in the data mining mart.
Chapter 10 will provide an overview of K-Means clustering, shows how the PSO-based
clustering algorithm based on [Sou04, Kom11] is derived and gives an overview of
cluster quality measures. K-Means is a local minimisation method which means it relies
on its view of the K nearest neighbours in order to determine which cluster a feature
vector belongs to. Local minima are often found which results in feature vectors from
more than one class being grouped together. The PSO-based clustering algorithm
attempts to avoid these local minima by making use of the information provided by the
swarm as a whole as opposed to the K nearest neighbours.
This chapter provides an overview of the way in which cluster centres are determined,
the way in which distances between clusters and feature vectors are calculated and an
in-depth explanation of the PSO algorithm used to cluster the data.
10.2 Determining cluster centres
The basic concepts relating to clustering was described in Chapter 5. As discussed in
that chapter, the first part to producing a clustered dataset is to find cluster centres.
These form the base reference points from which other datum points are grouped into
clusters. Finding cluster centres is considered an NP-hard problem [Gho06]. This
section will give an overview of the K-Means algorithm that forms the basis of the PSO-
based technique.
10.2.1 Traditional K-Means algorithm to find cluster centres
K-Means makes use of the squared 2-norm distance (more commonly known as
Euclidean distance) between points in order to find the minimised cluster centres. An
important aspect of the algorithm is that the number of clusters and the initial cluster
147
centres need to be chosen before the algorithm commences [Deb06].
Let 푿 = 풙 , 풙 , … , 풙 be a subset of points in the dataset, with K being the number of
clusters we wish to produce as 푾 = 흎∗ ,흎∗ , … , 흎∗ ; with the goal being to minimise:
푄(푾) = 풙 −푐
∈흎
Where Q(W) is the summation of the norm of the distance between each datum point
and its cluster centre. Each cluster j has 푐 defined as the centre for that cluster which is
the centroid for cluster j :
푐 = 1푛 풙
∈흎
with 푛 being the number of points in cluster j [Deb06].
The goal is to find K clusters, with a cluster defined as 흎∗ which minimises the distance
between each datum point and its cluster centre based on Euclidean distances [Deb06].
The K-Means algorithm is summarised in the next section by providing its initial
conditions and computational steps.
i. Initial conditions
Step 1 – An arbitrary partitioning of the data 푾 for K clusters is made and the initial
cluster centres are defined. For each point the cluster centre with the closest Euclidean
distance is found. If a tie occurs between two clusters, it is resolved via a random
assignment to one of the cluster centres.
ii. Computational steps
Step 2 – A new set of cluster centres are calculated from the initial set by computing the
centroid for each cluster represented as C(t). The centroid is the point that minimises
the total distances of all points in that cluster to itself and is defined as [Deb06]:
148
퐶(푿) = ∑ ‖풙 −푐 ‖
푛
Step 3 – If a stopping condition is met, then the final partitioning is represented by
푾(푡)and the final cluster centres by 푪(푡). If not, t is incremented by 1 and steps 1 – 3
are repeated.
A common stopping condition is to calculate the difference between successive
solutions for Q and test if it falls within an acceptable error tolerance range (휀):
푄(푡) −푄(푡 − 1) ≤ 휀
K-Means is a local optimisation algorithm designed to optimise clustering of data based
on the squared Euclidean distance between a point and its closest cluster centre [Ye05,
Deb06].
Section 10.3 provides an overview of the PSO algorithm used in the EPS4DM model to
cluster data.
10.3 PSO clustering algorithm
As discussed in Chapter 7 Particle Swarm Optimisation (PSO) is a stochastic search
algorithm that is inspired by the collective intelligence found within swarms, such as a
flock of birds or a shoal of fish. Particles move around within an n-dimensional search
space of the problem domain, each searching for increasingly better solutions. The
application of the technique and adaptations required to search for an optimal feature
subset are described in this section.
10.3.1 Search space
Let S be the dataset with 푿 = 풙 , 풙 , … , 풙 a subset of points in S, with K being the
number of clusters to be produced as 푾 = 흎∗ ,흎∗ , … , 흎∗ . For X the ith element 풙 is
defined as a vector in the n-dimensional search space with 푛 ≥ 0 representing the
dimensionality of the original dataset S [Che04, Sou04].
149
10.3.2 Euclidean distances and centroid calculations
As discussed in Section 8.3.1 the relative Euclidean distance and centroid formulae are
required to determine cluster centres and the distances between data points and their
nearest cluster centre.
Cluster centres are determined by using the following centroid formula [Sou04]:
풛 = 1푛 풙
∀풙 ∈
풛 defines the centre vector of cluster 푗 for the space containing 푛
particles.
풙 is representative of the rth data vector of S.
풄 is representative of the subset of data points making up cluster 푗
The relative Euclidean distance between particles and their nearest cluster centre is
defined in the following way [Che04, Sou04]:
퐷 풙 , 풛 = 풙 − 풛
풙 is representative of the rth data vector of S.
풛 is the centre vector for cluster 푗 calculated using the formula above
d represents the dimensionality of each centre vector 풛 and is effectively
the number of features for the vector.
Equation 10.1 – Calculating Cluster Centres
Equation 10.2 – Relative Euclidean distance
150
10.3.3 Initial population and particle encoding
A single particle in the search space represents a vector as a possible solution for 퐾
cluster centres with a dimensionality n in the space S. Thus particle 풑 = 풛 ,풛 , … ,
풛 where 풛 is a possible cluster centre [Sou04, Kom11]. It is recommended that
3 ∙ 퐾 ∙ 푛 particles are chosen for the swarm population [Sou04].
Figure 10.1 represents a sample particle configuration with 푛 = 2 as the dataset’s
dimensionality and 퐾 = 4 as the anticipated number of cluster centres. Each particle
maintains 4 cluster centres as well as the 2-dimensional vectors v(2) of the search space
S which are part of that cluster [Qu10]. The initial particle positions and velocities
representing the cluster centres are chosen at random. The details of this process are
described in the next section.
Cluster Center 1 Vectors v(2) ∈ S
Cluster Center 2 Vectors v(2) ∈ S
Cluster Center 3 Vectors v(2) ∈ S
Cluster Center 4 Vectors v(2) ∈ S
Figure 10.1: PSO particle representing a solution in 2 dimensions with 4 clusters
10.3.4 Particle velocity and position update
As it has been stated in Section 7.2.1 (see page 115), each particle’s velocity is governed
by a cognitive and social component; and particles are moved in the n-dimensional
hyperspace by accumulating the current velocity vector to the previous position vector.
In chaotic PSO, the 풓ퟏ and 풓ퟐcomponents are replaced by a random component to
ensure that particles can recover from local minima [Che04, Yan11].
퐶푟 = 4.0 × 퐶푟 × (1−퐶푟 )
The chaotic sequence variable is generated for each run of the algorithm t, bounded to
the following conditions [Yan11]:
Equation 10.3 – Chaotic sequence variable
151
퐶푟 = 0.001
퐶푟 ≠ {0, 0.25, 0.5, 0.75, 1.0}
퐶푟 ∈ (0, 1)
As discussed in Section 7.2.1 (see page 115), the velocity of each particle is updated per
social and cognitive component:
풗(푡 + 1) = 푤풗 (푡) + 풄 (푡) + 풔 (푡)
The velocity update occurs for particle 푖 in dimension 푗 of the domain at time interval 푡
with 푐 and 푐 being stochastic components in the range [0,1].
10.3.5 Fitness evaluation for particles
An optimal fitness function is critical to ensure both the convergence and solution
quality of the swarm. The relative Euclidean distance can be used to find the aggregate
distances of all data points in the cluster to their closest cluster centre. An acceptable
error tolerance can be decided upon to allow the swarm to converge. The fitness of each
particle that must be compared to this tolerance value is calculated using:
푓푖푡푛푒푠푠 = 퐷 풙 ,풛
The fitness for a particle can be determined by using the summation of the norm of the
distance between each datum point and its cluster centre; with 푖 = 1, … ,퐾; 푗 = 1, …푛. K
is the number of clusters, n is the number of datasets, Z is the cluster centre and X is the
data point in consideration [Sou04, Kom11].
10.3.6 PSO algorithm to calculate cluster centres
The PSO algorithm as adapted from [Kom11] needs to initialise the K cluster centres.
Equation 10.4 – Particle Velocity update
Equation 10.5 – Fitness function
152
This is done by assigning each vector in the search space S as defined in Section 10.3.1
to a random cluster centre. A random local best fitness is also assigned. The following
steps are then repeated for each particle for a set number of iterations [Kom11]:
begin
Randomly initialize particle swarm
Randomly generate Cr(0)
repeat
Compute the fitness value of particle as per Equation 10.5 for each
particle in C
Compare the fitness of the particle to its local best solution and set
the value of local best to the current value if fitness is better
Compare the fitness of the particle to the current global best found
and set the value of global best to the current value if fitness is
better
Compute the new particle velocity and position
Compute the Euclidean distance between each data vector and its
nearest cluster centre as per Equation 10.2
Cluster the data vectors by assigning them to their closest cluster
centres
Recalculate cluster centres as per Equation 10.1
until stopping criterion is met
end
153
10.4 Measuring cluster quality
The measure of cluster quality is dependent on two factors: 1) cluster density (i.e. the
closeness of each data point to its cluster centre) determined by using the final value of
the fitness evolution for the global best particle; and 2) the number of points that were
misplaced in the clustering process.
In order to calculate the number of misplaced points, a relative error rate can be
calculated by counting the occurrence of the post-clustered points being in the same
position as the pre-clustered points. An example of the latter is the counting of the
number of data points that were not moved at all (these refer to outliers which could
not be placed in a cluster – as mentioned in Section 5.3).
푒푟푟표푟 = 풆 1푖푓푣푒푐푡표푟ℎ푎푠푛표푡푚표푣푒푑0표푡ℎ푒푟푤푖푠푒
풆 represents the ith data vector of set S and is treated as a possible error
vector
As clustering and classification are unsupervised processes, the resulting dataset is not
guaranteed to contain data from the same class of objects. A measure of supervision
needs to be introduced to the algorithm, and this is referred to as cluster impurity11. On
a high level, a set of clusters will be produced and then a new goal is introduced, namely
to minimise the linear combination of cluster dispersion12 and cluster impurity for the
specific dataset.
Select 퐾 > 2 cluster centres from the computed set of clusters as 2푚 (푘 = 1, … ,퐾) to
minimize an objective function with weighted impurity and dispersion metrics:
min, ,..,
훽 ∗ 퐶푙푢푠푡푒푟_퐷푖푠푝푒푟푠푖표푛+ 휕 ∗ 퐶푙푢푠푡푒푟_퐼푚푝푢푟푖푡푦
where 훽 > 0; 휕 > 0 are positive real parameters.
The algorithm can be adjusted to be a pure unsupervised learning algorithm by setting
휕 = 0 resulting in only dispersion to be minimised. 11 Cluster impurity is the sum of data points within the cluster which actually belong to another cluster center. 12 Cluster dispersion is the sum of the differences between the vectors and the cluster center.
154
As discussed in Section 5.4.3 with the K-Means algorithm, points are assumed to belong
to their closest cluster centres based on the Euclidean distance.
Non-empty clusters are assigned labels based on the majority class of points that belong
to that cluster [Blo03, Dou08].
10.5 Process overview
The PSO algorithm will effectively produce a set of cluster centres for the data that was
denormalised in the extraction process overviewed in Section 9.2. The Euclidean
distance between each data point and its cluster centre is used to determine which
cluster the data point belongs to. The PSO clustering algorithm will then firstly use this
transformed data to calculate K cluster centres for the data and then to make use of the
centres to cluster the data. The resultant clustered data sets are then persisted in the
data mining mart.
10.6 Conclusion
Chapter 10 has discussed the PSO-based approach as derived from [Sou04, Kom11] to
cluster feature vectors. As discussed in Chapter 8 the clustering controller allows the
clustering agent to retrieve the set of feature vectors from the data mining mart and to
run the clustering process on the retrieved set.
The PSO-based clustering agent maintains a population of possible cluster centres and
the feature vectors assigned to each cluster. Each particle represents a full solution to
the clustering problem. Initially feature vectors are randomly assigned to clusters and
the cluster centres are calculated based on that assignment. The swarm then makes use
of the fitness function in order to minimise the summation of the norm of the distance
between each cluster and its assigned feature vectors. At each iteration new cluster
centres are determined based on this fitness value, feature vectors are re-assigned to
cluster centres and the fitness value is again calculated to determine if better clusters
have been formed.
155
Making use of a PSO algorithm as opposed to the K-Means algorithm means that
possible local minima could be avoided as the swarm guides each candidate solution.
The clustering controller can then make use of the error rate to determine whether
sufficiently good clusters have been formed. Based on its decision it can re-run the
process or store the resulting clustered data in the data mining mart.
In terms of the EPS4DM model the clustering process described in this chapter provides
the system with a mechanism to perform the clustering tasks on a set of feature vectors.
Chapter 11 will provide the details of the GA-based classification sub-system.
156
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
157
Chapter 11
EPS4DM’s classification process
11.1 Introduction
Chapter 10 provided an overview of the way in which the PSO-based classification sub-
system makes use of the data in the data mining mart in order to cluster the selected
data. This data is then persisted in the data mining mart. Both the clustering and the
classification tasks require feature vectors as their input. However as will be discussed
in Chapter 11, the classification task requires a training set which also needs a set of
class labels associated with a feature vector [Can06].
This chapter will provide the details pertaining to the GA-based classification system as
derived from [Pei98, Far07]. The classification process makes use of a scalable K-
Nearest Neighbour (KNN) classifier that requires as input a training data set with
associated class labels as well as a testing set without class labels. The KNN classifier
attempts to classify the feature vectors based on the training set [Can06]. The purpose
of using a GA is to produce a transformation vector that could essentially reduce the size
of the search space by assigning weights to each feature and then running the
transformed set thought the KNN classifier.
Chapter 11 will also give the details of the input data, the chromosome mappings, the
mutation and crossover operators, the implementation of the GA and its parameters as
well as a KNN classifier that is used to assess the quality.
11.2 The GA Feature extractor
In order to gain an understanding of the construction of the GA and its components the
GA feature extractor will be broken down into its constituents. This process involves a
training data set x (if available) and a KNN classifier that will run a KNN rule to evaluate
the fitness of the transformed data set.
The feature extraction’s end goal is to produce a reduced data set for the classifier.
Introducing classification decision feedback to guide the feature extractor is a good
158
approach and is the approach that will be followed in the EPS4DM model. Figure 11.1
provides an overview of this GA/KNN approach and each operation will be discussed
[Pei98].
11.2.1 The weight vector
As seen in Figure 11.1, the initial population comprises a random initialisation of the
population and the weight vector. The weight vector 풘∗is defined to be an N = |x| vector
generated by the GA also defined as an NxN diagonal matrix W.
This matrix is used to multiply each feature vector x to produce a new vector y
according to a mapping function푀(풙). This equates to the linear transformation seen in
Figure 11.1.
11.2.2 Linear transformation matrix
풚 = 푀(풙) = 푊(풙)
푾 =
⎣⎢⎢⎢⎡풘 0 ⋯ 00 풘 ⋯ 0. . . .. . . .0 0 0 풘 ⎦
⎥⎥⎥⎤
or 풘∗= [풘 , 풘 , . . . ,풘 ], x = [풙 , 풙 , … ,풙 ]
The transformation matrix can be used on two types of data, namely numeric data and
complex data.
When complex data types that contain indexes to features are required, the weight
vector is a Boolean bit mapping. Thus 풘 ∈ {0,1}, 1 ≤ 푖 ≤ 푁 and is interpreted that if
the i-th weight is a 1 then that feature is deemed part of the feature vector.
When normalised numerical data is used the algorithm will search for a relative weight
over the defined normalised values. Therefore, 풘 ∈ [0.0, 10.0], 1 ≤ 푖 ≤ 푁. This implies
that features are being selected in a linearly transformed space and that those elements
that are converging towards 0 are seen as not important and will eventually be
removed.
The resulting weights indicate that a particular feature is useful and this is referred to
159
as its discriminatory power. The approach of linear transformations is only applicable
when the data patterns or classes are linearly separable [Pei98].
Initialise population (random weights assigned
풘∗
풚 = 푀(풙) = 푊(풙)
Evaluate by KNN Rule
Satisfy
Selection
Crossover
Mutation Best set of weights 풘∗
No
Yes
New population
Figure 11.1: GA/KNN feature extractor [Pei98]
160
11.3 Chromosome mapping
As discussed in Chapter 7, the GA requires an appropriate representation of the search
space by way of a chromosome mapping. Chapter 9 described how feature vectors are
constructed from the relational tables. These vectors contain the information required
for the chromosome.
The feature vectors form the basis of the chromosome mappings that will be used.
When numeric data is used, the chromosomes contain real valued weights as depicted
in Figure 11.3. When indexed data in the chromosome will make use of bit masking in
order to flag the features as shown in Figure 11.2.
0 1 1 1
Figure 11.2: Bit mask used for chromosome mapping [Ray00]
2.5 3.5 … 7.8 1.9
Figure 11.3: Feature weight used as chromosome mapping
To maintain maximum flexibility, additional bit masks will also be conceived which
allows complex data types to be flagged as normalised data or other data [Ray00].
11.4 Crossover operator
As discussed in Chapter 7, the GA makes use of a crossover operator to splice up the
genes of the parents in order to create children to fill the population. The feature
vector’s data determine the type of data in the chromosome – either real valued weights
or bit values.
Regardless of the data type these will be used to create the features of the children.
Feature 1 is excluded from the classifier Feature 2 is included in the classifier
161
Figure 11.4 indicates a random crossover point that is selected from each parent and
new feature weight vectors are created as children [Eng07, Far07].
Figure 11.4: Crossover operation of mapped feature vectors [Eng07]
11.5 Mutation operator
As discussed in Chapter 7, the mutation operator is employed to introduce diversity into
the population to allow new feature vectors which were not previously there to be
created as possible solutions. This diversity allows the final algorithm to converge to the
optima.
The mutation operator will need to know what type of allele it is dealing with and thus
the bit masking discussed earlier is utilised. If indexes are used then values are mutated
to valid index values, if normalised data is used then the mutation operator needs to
randomly select a valid value from this data to replace into the chromosome.
The mutation operator cannot introduce an index or a numerically mapped value that
does not exist as this will not map to a feature. Figure 11.5 depicts a valid mutation
operation on feature indexes [Far07, Eng09].
3 22 35 7 19
2 11 21 3 41 12 1 27
2 11 21 35 7 19
3 22 3 41 12 1 27
Parents
Children
162
Figure 11.5: Mutation operation of a single feature vector [Eng07]
11.6 Selection strategy
The GA requires that population members are carried to the next iteration of the
algorithm. These are the members that will be subject to the mutation and crossover
operators in the next generation. The selection strategy that will be used for the GA is
fitness proportionate selection (also known as roulette wheel selection) combined with
the elitism13 strategy [Deh08].
11.6.1 Roulette wheel selection
A fitness value is assigned to each chromosome feature vector after the GA operators
have completed the crossover and mutation.
This fitness value is then used to create a probability p of each chromosome to be
selected, where 푓 is the fitness of chromosome i and N is the population count.
푝 = 푓
∑ 푓
This probability effectively normalised the selection of the chromosomes to 1 and in a
similar fashion to how a roulette wheel spins in a casino, a random selection is made on
this probability to see which candidates survive. Chromosomes with a higher fitness are
more likely to be selected but there is also a chance that less fit members makes it
through.
The less fit chromosomes that survive provide an advantage as their individual genes
could contain possible valuable solutions for the generations ahead [Eng07].
13 Elitism allows some of the best solution from the current generation to be promoted to the next one.
3 22 35 7 19
3 22 6 7 12
Parent
Child
163
11.6.2 Elitism strategy
In order to maintain good solutions, chromosomes with a very high fitness level will be
promoted to the next generation unaltered in order to determine in the next round if
their gene constructs also produce a high fitness. Some chromosomes might survive
unaltered for a few generations only to be mutated later [Eng07].
11.7 KNN rule and training data
As discussed in Chapter 5, the purpose of classification is to label a particular set of data
as part of a class. The KNN rule requires the class labels in order to perform the nearest
neighbour comparison. Training data is required as input to the KNN classifier in order
to measure classification error rates. After appropriate chromosomes have been
generated their fitness needs to be assessed, and this is where the KNN classifier comes
in.
11.8 Fitness function
The fitness evaluation of a feature vector is done by making use of a classifier. This
classifier maps data from a so-called pattern space (initial features) to a feature space
(final selected features) and then to a classification space (which are the class labels).
Thus
퐺 = 푿 × 푾 → 풀 → 퐾
where X is the pattern space, W the weight vector (selected features), Y the feature
space (reduced pattern space) and K the set of class labels.
The classifier needs to be trainable according to a function
퐽(풘∗):푊 → ℝ
where ℝ is in the domain of real numbers. Therefore, the process that will be followed
find a weight vector 풘∗ ∈ W in order to minimize 퐽(풘∗).
164
11.8.1 Scalable classifier
A scalable classifier is defined as one where, if the i-th feature is changed in the input
matrix X and provided with the suitable vector 풘∗, the results are equal. Thus if
x = [풙 ,풙 , … , 풙 , … ,풙 ], x’ = [풙 ,풙 , … , 풙 ′, … ,풙 ]
and the element 풙 ≠ 풙 푡ℎ푒푛퐺(풙,풘) = 퐺(풙 ,풘)
G is called a scalable classifier [Pei98].
11.8.2 GA/KNN fitness function
As depicted in Figure 11.1, a GA is used to train a KNN scalable classifier to obtain the
optimal weight vector 풘∗. The function J defined in the previous section is used as a
measure of the error rate of the classifier (the ability of the classifier to mark a pattern
as part of a class).
Each weight vector 풘∗ that is generated is used to create a new feature vector y which
makes up a new possibly reduced feature space. The KNN is run in this space computing
the weight 퐽(풘∗)for each new vector y while the number of miss-classifications are
stored. The 풘∗vector that produces the fewest number of miss-classifications will be the
optimal transformation matrix for the space.
The miss-classification tolerance must be based on the error-rate that will be derived
from the training data set and thus fitness is defined as
퐹푖푡푛푒푠푠 = 퐽(풘∗) = ( )
This fitness function however merely examines the performance of the weight vector to
correctly classify patterns based on the performance of the test data set. It does not take
into account the inter-class separation of the labelled patterns, thus the introduction of
Neighbour Awareness14 (NA) forms part of the fitness function defined as:
14 Neighbour awareness refers to the GA being aware of the vectors outside of a specified class.
165
퐹푖푡푛푒푠푠 = 퐽(풘∗) + 푁퐴 = 훾 ( )
+
훿 /
where 푛푀푖푛 is defined as the number of nearest neighbours which were not labelled as
part of a class and K is the total number of nearest neighbours. 훾 and 훿 are dynamic
tuning parameters and their values will be discussed in the GA parameter section
[Pei98].
11.9 GA parameters
The selection of the GA parameters is explained in Table 11.1 [Pei98, Far07].
Table 11.1: GA parameters
GA Parameter
Crossover rate
Mutation rate
Population size
Generations 훾 (Fitness) 훿 (Fitness)
0.8
0.1
100
50
2.0
0.2
The results of the GA/KNN classifier as presented in Chapter 12 were used to
empirically derive the parameters in Table 11.1.
11.10 Conclusion
Chapter 11 has explained the specifics of the GA/KNN classifier process that will be
used in the proposed EPS4DM model as derived from [Pei98, Far07]. The process makes
use of the feature vectors as derived in the process described in Chapter 9 but requires
two distinct sets of data, namely a training set with class labels and a test set without
class labels. The goal of the GA is to produce a transformation vector that can be applied
166
to the original set. This highlights certain features as less or more important than others
based on the assigned weights. The GA maintains a set of these transformation vectors
which it used to derive a new set of feature vectors to send to the KNN classifier. The
fitness evaluation of the GA is in essence the KNN classifier.
By combining the GA process of finding the optimal features with an actual classifier the
fitness function drives the generation of chromosomes that will contain the fewest
number of discriminatory features to still correctly label items from the test set. As
discussed in Chapter 8, the classification controller will obtain a single instance of the
search space as a set of training and testing feature vectors. The classification agent
evolves transformation vectors that are used to produce a new feature set to send to the
classifier. The controller can then decide, based on the results of the fitness function, to
evolve better transformation vectors. Otherwise, if the best transformation vector for
the dataset is found the controller will persist with the vector as well as the final
classified dataset to the data mining mart.
Chapter 11 has provided the details of the mechanism that will be used in the EPS4DM
model to perform data clustering as well as the potential reduction of the size of the
search space by highlighting which features are discriminatory in terms of their
weighting.
Chapter 12 will provide an overview of the implemented EPS4DM model and will
present the results of the classification and clustering processes. The chapter will be
divided into three sections, namely: an overview of the implementation, the results of
the PSO-based clustering algorithm from Chapter 10 and the GA-based data reduction /
classification algorithm presented in this chapter.
167
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
168
Chapter 12
Implementation and Results
12.1 Introduction
Chapters 8 to 11 presented an overview of the data pre-processing system envisioned
that will orchestrate the pre-processing tasks. Chapter 9 explained in what way data
will be transformed into the correct format, Chapter 10 introduced the PSO-based
clustering process, and Chapter 11 gave an overview of the GA/KNN-based classification
process.
This chapter will be divided into three main sections, namely: the Java based
implementation, the results of the PSO-based clustering algorithm that was discussed
Chapter 10, and the GA-based data reduction / classification algorithm presented in
Chapter 11.
12.2 Java based implementation
The EPS4DM model that was discussed in Chapters 8 to 11 was implemented in Java 1.6
running on the JVM 1.6.32 and making use of the Spring framework and Hibernate for
DB mappings. All PSO and GA code was written in pure Java without the aid of any
additional frameworks. This section will provide an overview of the data
denormalisation methods implemented, as well as the PSO clustering and GA/KNN
classification methods used.
12.2.1 Data denormalisation and helpers
The data retrieval and denormalisation processes are provided by means of helper
classes that connect to the required tables in a MySQL database and perform extract and
transformation tasks. The format of the data differs for clustering and classification
tasks. Figure 12.1 illustrates the CalcHelper as well as the DataReader classes that
perform the read tasks and provided methods to calculate the Euclidean distances.
The Iris dataset was chosen as an UCI standard dataset to assess the feature vector
169
classification. The dataset is multivariate, contains real values with four features and
has three class labels (Addendum A contains the details of the dataset). It is an
extensively researched set and provides a good comparison for the algorithm’s
performance.
An example input feature vector format for the Iris dataset in a clustering process is
provided in Table 12.1, listing three feature vectors with four features.
Table 12.1 Clustering data sample from Iris dataset
Sepal length
Sepal width
Petal length
Petal width
5.1 3.5 1.4 0.2
4.9 3 1.4 0.2
4.7 3.2 1.3 0.2
An example input feature vector format for the Iris dataset in a classification process is
given in Table 12.2, listing three feature vectors with four features and a class
denomination for each of the three classes in the set.
Table 12.2 Classification data sample from Iris dataset
Sepal length
Sepal width
Petal length
Petal width
Class
4.7 3.2 1.3 0.2 1 (Setosa)
6.9 3.1 4.9 1.5 2 (Versicolor)
7.1 3 5.9 2.1 3 (Verginica)
12.2.2 PSO-based Clustering
The PSO-based clustering technique as discussed in Chapter 10 was implemented by
way of utilising the standard components of the global-best PSO algorithm and adapting
it to optimise cluster centres and not just numerical values.
170
Figure 12.2 provides an overview of the classes required to perform clustering. A
Particle is extended to a ClusterParticle which contains as data a set of Cluster
centres. Each Cluster contains all the elements as a DoubleArray of the feature vector
created in the format presented in Section 12.2.1.
Figure 12.1: Helper classes within the EPS4DM prototype
CalcHelper also has methods to calculate the cluster means to determine cluster
centers for the elements which have been assigned to it. The PSOClustering class is the
PSO algorithm implementation that contains a set of ClusterParticle’s as its swarm.
Each ClusterParticle represents a full solution of the search domain – thus it contains K
clusters that are assigned elements from the input set. The class also contains a method
to save the resultant global best solution to the data mining mart.
An example of the format in which the clusters are saved is provided below. This
illustrates three clusters with two feature vectors of dimension four assigned to them:
[5.1,3.5,1.4,0.2] [5.4,3.9,1.7,0.4]
[6.7,3.0,5.0,1.7] [7.3,2.9,6.3,1.8]
[4.9,2.4,3.3,1.0] [5.2,2.7,3.9,1.4]
171
Any algorithms or systems that require this data can retrieve it via an interface and
make use of the clustered data.
Figure 12.2: PSO clustering classes within the EPS4DM prototype
The following section provides an overview of the classes required to perform the
classification.
12.2.3 GA/KNN Classification
The GA/KNN classification process as discussed in Chapter 11 was implemented by
making use of a standard GA’s components as well as a KNN classifier. A KNN classifier
was created as per Figure 12.3 which takes as input a training data set in the format
172
described in Section 12.2.1 for classification. This data set has a class label. It also takes
a test data set that is a similar dataset without a class label. The KNN class also
contained a method to perform the normalisation techniques that were discussed in
Chapter 11 which maps the input values to a range [1.0,10.0]. Added to this it contains a
method for performing the nearest neighbour search function in order to assign class
labels to the test data set.
The GA requires an appropriate candidate on which to perform the mutation and cross-
over operations. The ClassificationCandidiate class as seen in Figure 12.3 provides
this functionality by maintaining the 풘∗weight vector defined in Chapter 11 as its
genotype. It also contains a selection mask that is used when working with mixed-type
data. Furthermore, it includes a KNN classifier that is used by its fitness function. It
makes use of the weight vector or selection mask to create a new dataset to run thought
the KNN classifier.
The GA algorithm was implemented in the class GAClassification as can be seen in
Figure 12.4. It contains a set of ClassificationCandidates as its population each of
which represents a weight vector or selection mask that can be used to reduce the
original dataset. It also contains a method for saving the weight vector or feature mask
in combination with the classified dataset to the data mining mart.
The following section describes the hardware-based infrastructure used to perform the
test cases.
12.3 The test setup
The machine used for these tests is specified in the following way: AMD Athlon X2 555
(Dual Core) running at 3.2GHz (6MB cache) with 8GB of DDR3 667 MHz memory on
Windows 7 Ultimate 64-bit. The system also has a SATA II Solid State Drive (SSD) on
which the JVM as well as the RDBMS was running. However, the EPS4DM system can
scale to a platform which supports distributed computing. Also it should be noted that
both the PSO and GA algorithms could also be re-worked to operate in a General
Purpose-Graphics Processing Unit (GP-GPU) based system for increased performance.
173
The following two Sections 12.4 and 12.5 will discuss the results obtained from the PSO-
based clustering and GA/KNN classification processes.
12.4 PSO-based clustering results
As discussed in Chapter 10 the clustering process’s end result is to store the clustered
data in the data store. To demonstrate the PSO algorithms ability to perform the tasks
required by the EPS4DM’s clustering process, the results are reported on in Sections
12.4 and 12.5.
Figure 12.3: KNN classifier and GA chromosome classes used within the EPS4DM
prototype
174
Figure 12.4: GA algorithm and result classes used within the EPS4DM prototype
12.4.1 PSO parameters
Sections 10.2.3.1 and 10.2.3.2 described the setup parameters for the PSO-based
clustering algorithm. For these experiments the following parameters were used: 푐 , 푐
= 1.5; K = 5 and w = 0.1.
12.4.2 Cluster centres
In order to gauge the PSO algorithm’s ability find cluster centres, a randomly generated
dataset of five spherical clusters were generated with 50 points each, adding up to 250
data-points as can be seen in Figure 12.5. The K-Means algorithm was run on the set and
175
the resultant cluster centres were noted. Then the PSO algorithm was run to determine
whether it also found the centres. As can be seen in Figure 12.5 the PSO algorithm was
successful in calculating the cluster centres of the test dataset.
12.4.3 Cluster purity
As discussed in Section 10.2.3.6 the cluster purity (or error rate) is used for outlier
detection on the data. Another simple test was devised to ensure that the PSO algorithm
incorporates post-classified positions of elements. Figure 12.6 depicts a dataset with
two spherical clusters, the left one having 400 elements and the right one 100 elements.
Figure 12.5: K-Means and PSO cluster centre analysis
As can be seen in Figure 12.6 the K-means algorithm uses nearest neighbour search
5 generated clusters K-Means cluster centres
PSO cluster centres
176
only and thus it clusters elements from the left cluster into the right cluster due to the
Euclidean distances. The PSO algorithm makes use of the effective clustering error rate
(number of miss-placed points) to implement outlier detection and thus clusters the
elements into two distinct spheres.
12.4.4 Feature vector analysis
As discussed in Chapter 9, relational database tables require data transformation in
order to get the data into the required format.
Figure 12.6: K-Means and PSO cluster centre analysis with error rate feedback
Two generated clusters K-Means cluster centres
PSO cluster centres
177
The method discussed is called a feature vector approach, which constructs binary
mappings of table attributes and their occurrences. The best way to assess the accuracy
of the feature vector approach would be to cluster a known multivariate dataset using
the PSO algorithm, and then to construct relational tables to represent the dataset and
then run the process on that.
12.4.4.1 Iris dataset control set
Both K-Means and the PSO algorithm were run on the Iris dataset and the results of
various factors are listed in this section. The experiments were run 30 times on the
dataset using each algorithm and the results are the averages of these runs. A standard
deviation is noted as well.
Table 12.3: Intra-cluster distances on the Iris dataset
Factor K-Means PSO Avg 106.90 97.05 (Std) (14.30) (0.08) Best 96.89 96.98
Table 12.4: Error rates on the Iris dataset
Factor K-Means PSO Avg 17.95% 11.50% (Std) (11.02%) (2.04%) Best 11.02% 10.00%
The goal of this test is to provide a test set for the accuracy obtained by clustering the
Iris dataset from the raw data.
12.4.4.2 Iris dataset as Relational table
Addendum A provides sample structures of two relational views of the Iris dataset. The
feature vector construction task was firstly run on this set in order to obtain the
required mappings and then the PSO algorithm was run on it. As can be seen in Tables
12.3 and 12.4 the effective inter-cluster distances and error rates are very close if not
identical to the raw dataset. The differences can be ascribed to the stochastic nature of
the PSO algorithm.
178
Table 12.5: Intra-cluster distances on the Iris dataset for simple relational mapping
Factor PSO (Raw) PSO (Simple table) Avg 97.05 97.10 (Std) (0.08) (0.082) Best 96.98 97.01
Table 12.6: Error rates on the Iris dataset for simple relational mapping
Factor PSO (Raw) PSO (Simple table) Avg 11.50% 11.51% (Std) (2.04%) (2.04%) Best 10.00% 10.00%
As can be seen in Tables 12.5 and 12.6 the effective inter-cluster distances and error
rates are very close if not identical to the raw dataset and the simple mapping. The
differences can be ascribed to the stochastic nature of the PSO algorithm and the more
complex mappings produced for the high-relational structure. Tables 12.7 and 12.8 list
the effective inter-cluster distances and error rates for the complex relational mapping.
Table 12.7: Intra-cluster distances on the Iris dataset for complex relational mapping
Factor PSO (Raw) PSO (Simple table) PSO (Complex table)
Avg 97.05 97.10 97.20 (Std) (0.08) (0.082) (0.083) Best 96.98 97.01 96.99
Table 12.8: Error rates on the Iris dataset for complex relational mapping
Factor PSO (Raw) PSO (Simple table) PSO (Complex table)
Avg 11.50% 11.51% 11.51% (Std) (2.04%) (2.04%) (2.06%) Best 10.00% 10.00% 9.97%
12.4.4.3 Randomised multivariate structures
Table 12.9 summarises the results obtained from the randomised multivariate runs.
179
Table 12.9: Fitness values and cluster centers for 3 sets of random data
As can be seen the fitness values obtained for the relational mappings are very close to
those obtained by the raw dataset. The fitness values are the intra-cluster distances and
are specific to each generated dataset.
The aim of the feature vector analysis was to prove that the process can cluster multi-
relational data very closely to the way that the raw data would be clustered. To this aim
multivariate datasets and resulting relational structures were generated and tested to
document the actual cluster vectors generated. The generated raw data and structure
was stored to allow multiple tests to be repeated on the same set. The Raw and
Relational processes were run 30 times on each set.
12.4.4.4 Feature vector conclusion
By way of conclusion it is clear that through a controlled test of known data and user-
Dataset generated
PSO (Raw) PSO (Relational)
Fitness Centres Fitness Centres
3 Centres, 2 Attributes (2D)
0.00076876 (68.5436, 11.0967) (8.5423, 10.4508) (37.1243,-5.997)
0.00077830 (69.0002, 10. 995) (8.7765, 10.012) (36.9996,-5.943)
3 Centres, 2 Attributes (2D)
279.4876 (1.0245, 3.7645) (-0.5412, 0.3554) (4.2769, 4.002)
279.6523 (1.115, 3.8871) (-0.4999, 0.3667) (4.254, 4.1321)
2 Centres, 3 Attributes (3D)
5321.2456 (6.8443,5.2212, 3.2213) (0.7754, 9.1235, 10.3456) (27.0032, 66.9856, 0.3224)
5315.2409 (6.9432,5.2140, 3.1476) (0.7105, 9.2106, 9.9986) (26.9965, 68.0045, 0.3665)
180
designed relational tables the feature vector approach yielded results similar to the raw
data. The set of randomised raw and relational structures confirmed the control set’s
observations.
12.5 Clustering conclusion
As can be seen from the results presented there were two tests to consider. First to
consider was the PSO algorithm’s ability to cluster more accurately (by making use of
outlier detection) than the K-Means algorithm. For the Iris dataset it was noted that the
K-Means algorithm tended to cluster the data into two clusters while the PSO algorithm
managed to cluster the data into three clusters, which more accurately depicts the
search space. Secondly, the feature vectors approach to essentially denormalise the
relational structures into vectors of attributes (henceforth used interchangeably) and
then to perform clustering based on the pattern was tested. It can be concluded that the
PSO algorithm performed well in terms of clustering accuracy and that the feature
vectors approach introduced some deviance between the raw data and relational data.
This could be ascribed to both the stochastic nature of the PSO algorithm itself and the
complex vector structures produced by high-dimensional tables. The resultant table
structures and cluster centres are stored in the data mart.
12.6 GA-based Classification results
As discussed in Chapter 11 the GA/KNN classification process has dimensionality
reduction at its disposal that only allows the most relevant features of a dataset to be
included for analysis. The results of these experiments will be presented in Sections
12.7 to 12.10.
12.7 GA algorithm parameters
As discussed the Chapter 11, the GA-based algorithm requires a number of parameters.
Their values are listed in Table 12.10.
12.8 Randomised datasets
The first thing to analyse was the ability of the GA-based algorithm to identify
181
significant features and give them the appropriate weights. With reference to Chapter
11 the identification of the transformation matrix 풘∗was the goal. The matrix 풘∗was
then used to compute the reduced dataset by eliminating the features not required. To
this end a similar approach used by [Pei98] was followed where a feature matrix was
generated. This feature matrix incorporates known significant features based on other
features. Features 1-3 and 9-12 contain stochastic values in the range {0.0 – 10.0} while
features 4-8 contain generated expressions which incorporate other fields selected from
the other features in their calculations. This mapping ensures that the observers know
which features are important and which are not. Table 12.11 represents an example of
such a generated dataset where Fi is a feature selected from 2-8. Effectively this means
that features 2-8 are significant to the dataset and that 1 and 9-12 are not.
Table 12.10: GA-based algorithm parameters
GA Parameter
Crossover rate
Mutation rate
Population size
Generations 훾 (Fitness) 훿 (Fitness)
0.8
0.1
100
50
2.0
0.2
The generated feature matrix was stored for multiple iterations to be run and tested.
The first goal was to generate the 풘∗transformation matrix. Thus 45 permutations of
each class V-Z were generated for each of the 12 features based on the generated matrix
resulting in 540 patterns of this randomised set. This set of 540 patterns was used as a
training set to calculate 풘∗ . The results of an example run of such a set are seen in Table
12.12.
The fittest 풘∗transformation matrix was selected from the training set and was used to
generate reduced datasets for the 1000 permutations generated from the same feature
182
matrix used to calculate 풘∗. The fitness function uses 8 bits to represent each weight
component, wi, in the weight vector 풘∗based on the function defined in Section 11.8.2.
Table 12.11: Generated Feature matrix
Class Features 1-3 Feature 4-8 Feature 9-12
V {0,1} {0,1} * [Fi] {0,1}
W {0,1} {0,1} * [Fi] + {0,1} {0,1}
X {0,1} {0,1} * [Fi] - {0,1} {0,1}
Y {0,1} {0,1} * [Fi] + {0,1} * 10 {0,1}
Z {0,1} {0,1} * [Fi] {0,1}
Table 12.12: Transformation matrix 풘∗values
풘∗value F1 F2 F3 F4 F5 F6 F7 F8 F9-12
Selected 0 1 1 1 1 0 1 1 0
Weight 0.000 3.2677 2.7643 5.6753 6.4343 0.000 2.5454 8.1143 0.000
The KNN classifier was compared to the GA/KNN classifier by running both 10 times on
each set of 540 patterns, generating the best 풘∗and then calculating the transformed
dataset. It was concluded that the KNN classifier produced an average 4.5%
classification error rate and the GA/KNN had about a 3.1% error rate.
The goal of this test was to assess the GA/KNN algorithm’s ability to identify significant
features from a feature matrix with known significant features. The average running
183
time to compute the 풘∗matrix for this example is about 10 minutes. The calculation of
the reduced feature matrix is pure matrix multiplication.
12.9 UCI-standard datasets
In order to gauge the classification accuracy of real-world datasets the Iris and Glass
datasets were compared. In the previous section a test dataset was generated by
allowing possible values from the set range to be generated. In these examples one has
the given set of data with the Iris set containing 150 exemplars and the Glass set 214
exemplars. Suitable values could not be randomly generated, hence the idea was to use
a subset of the data as training data and the other as test data. A good idea is a 50/50
split (also known as leave-one-out) so 75 of the Iris set exemplars and 107 of the Glass
set were used as training sets.
An example for the Iris dataset is given to illustrate the testing process followed for
real-world datasets. The labels for the class are setosa, versicolor and verginica. The
pre-processing task to consider was the way in which the user knew which elements
should be in which class and the goal was to then perform dimensionality reduction of
the features to still yield the same results. Hence 25 random exemplars from each class
was used to train the GA/KNN classifier and yield the best 풘∗transformation matrix for
the Iris dataset. This was run 30 times. Then the 풘∗transformation matrix was used to
calculate the reduced set from another 25 random exemplars from each class. This was
done 1000 times and each resultant reduced feature set was tested for classification
accuracy. Table 12.13 summarises the average results of both the Iris and Glass
datasets.
As can be seen in Table 12.13 the GA/KNN approach by producing the reduced dataset
from the 풘∗ transformation matrix is more able to classify the set than the KNN
approach with a lower error rate. The runtimes stated in the table are for a single
iteration of each algorithm’s run.
These datasets have been extensively researched and the purpose of this test was to
determine if the GA/KNN classifier was able to produce a reduced dataset which also
more accurately classifies the datasets. As observed the transformation vector produced
184
by the GA highlighted that features 3 and 4 are significantly more important to the Iris
dataset than features 1 and 2. Thus the conclusion can be drawn that either features 1
and 2 could possibly be omitted from the dataset to reduce the search space size.
Table 12.13: Iris and Glass datasets
Dataset KNN GA/KNN
Error Rate Time (s) –
single run
Error Rate Time(s) –
single run
Iris 7.1245% 0.547643 5.24365% 0.42481
Glass 3.6894% 0.600121 2.48032% 0.45632
12.10 High-dimensional datasets
Most relational databases have structures with a high number of dimensions that will
result in very big feature vector mappings. The real test for the GA/KNN approach
would be to subject it to a known dataset of high dimensionality. To this end the 2, 4-D
dataset as described in Addendum A was used. This dataset has 96 features which
results in a 96-bit chromosome mapping. As each weight component of the 풘∗matrix
requires 8-bits when the actual GA is running the calculated weight values for each
feature in the resultant chromosome is actually 96 * 8 = 768 bits in size. As with the Iris
example the training set must be a subset of the original set but due to the high number
of features and low number of examples all of the original exemplars are selected and
ran though the GA/KNN classifier. Due to the number of features this was only run once.
The resultant best 풘∗transformation matrix was used to calculate the reduced set and
this was used on a random selection of 1/10th of the original set and ran no more than
10 times. The accuracy results are summarised in Table 12.14 that depict the error rate
of the results.
On inspection of the resultant dataset for this example it was found that 65 of the
185
original features were disregarded for being 0 weighted. This effectively provided a
67% reduction in the search-space. The significance of the results shows that on a full
dataset the error rate was lower. This means that a more accurate classification can be
made based on the number of exemplars inspected with the downside being
performance. The original feature vector and reduced set are saved in the data mart.
Table 12.14: Error rate of 2, 4-D dataset
Total patterns K-Means GA/KNN
23 8.3452% 1.9965%
232 6.6785% 0.82567%
12.11 Classification conclusion
It can be concluded that the GA/KNN classification approach does outperform the KNN
approach in terms of accuracy and the reduction of the dataset. The major problem
arises as the number of features increase, since the computation time required
increases as well.
12.12 The data mining mart
As discussed in Chapter 8, the purpose of the overall process is to store the centroid
vectors and reduced feature vectors for the analysed data structures. The reason for this
is that as long as the data structures do not change the addition of a low percentage of
new data to the set would not require the set to be reprocessed, since the same reduced
feature vectors or cluster centroids can be used by the data mining algorithms to
produce rules.
12.13 Conclusion
This chapter has presented the results of the PSO-based classification and GA-based
clustering algorithm. The results show that both methods are more effective than their
186
K-Means counterparts in performing the pre-processing tasks.
The PSO-based denormalisation approach of constructing feature vectors proved to
yield near-similar results to the processing of the raw datasets. Therefore, it was seen as
an acceptable method to denormalise relational structures for clustering.
The GA/KNN classifier constructs a singular feature vector from all the required
attributes. Thus the more complex the relational structure with a higher dimensionality
the more processing time is required to produce the reduced dataset.
The major concern in this chapter was that of computational performance for high-
dimensional datasets as well as the correlation between the amount of examples in a
training set that is used to ultimately reduce the set.
Chapter 13 provides an overview of the dissertation, discusses whether the research
questions were answered and raises possible future work.
187
1. Introduction
2. Data storage evolution 3. Database models 4. Data
warehousing
5. Data mining 7. Computational intelligence
9. Data extraction and
denormalisation 10. Clustering 11. Classification
12. Implementation
and Results
13. Conclusion and future work
6. Intelligent Agents
8. Data pre-processing system
EPS4DM Model
188
Chapter 13
Conclusion and future work
13.1 Introduction
In this dissertation a data pre-processing model named EPS4DM was presented. This
model collated the required relational data as decided upon by a user, performed the
pre-processing tasks and stored the processed data in a data mining mart. This enables
the data mining rule extraction algorithms to use this mart as input source for their
process.
The main idea behind the research was to give an overview of the current state of
production OLTP databases and data warehousing structures as the input to the data
mining process. A further goal was the investigation of and informing of the reader of
the current data mining pre-processing tasks and the way in which computational
intelligence (CI) based algorithms coupled with multi-agent technology can manage
these tasks.
Another goal of the dissertation was to investigate the viability of implementing agents
embedded with CI to orchestrate the pre-processing tasks in order to maintain a data
mining mart. Chapter 12 presented concrete results that once the dataset has been
chosen, can run both the clustering and classification tasks to completion and store
their results in the data mart without user intervention. The GA/KNN classifier alluded
to the performance of complex data-types and will be discussed while answering the
research questions.
13.2 Evaluation
The denormalisation of the chosen database structures for the clustering process was of
primary importance. Thus the data transformation component named feature vector
construction was analysed by comparing the accuracy of clustering data with both the
raw format and feature vectors as input. The results indicated that the approach is an
acceptable manner in which to automate the denormalisation of required relational
189
structures.
The clustering performance of the PSO-based algorithm was assessed on both generated
and known real-world datasets by using the patterns generated by the transformation
process. This was a good method to use when compared to the K-Means algorithm. The
automation of the clustering process is also sound as the cluster centres for the input
parameters were generated and stored in the data mining mart.
The classification performance of the GA/KNN algorithm was also assessed on
generated and high-dimensional real-world datasets. It performed well to reduce the
feature space but it did yield a correlation on performance. This will be addressed in the
following section. The answers to the research questions posed in Chapter 1 will be
discussed in Section 13.3.
13.3 Research-driven questions that had to be answered by the dissertation
Question 1: Why are OLTP system databases structured the way they are, how is
this data stored and in what way does data warehousing differ from
data mining?
Chapters 2 and 3 introduced the data storage concepts from their first inception to the
current state of the art approaches. Single programs used files to persist with some
form of data that it required. As systems became more complex and multiple programs
began to interact with each other using the same data the need arose for a database – a
single storage place for multiple programs. When the need arose for increased access to
these databases the concept of database clustering was introduced to allow access to
multiple instances of databases.
The most profound change was the introduction of the relational database – which
forms the foundation of the modern OLTP system. Transaction processing is at the core
of the modern organisation and databases are structured in a way to allow optimum
insertion and access to the data in these tables for systems to interact with each other.
The data in this database is usually stored in third normal form and will most commonly
represent the latest up-to-date data.
190
Decision support systems were the first implementations of what is known as data
warehouses today. They need to allow businesses to make decisions based on the data
they have. A data warehouse differs from an OLTP database implementation in that it is
a subject-oriented, integrated, non-volatile, and time-variant collection of data.
Data mining can be seen as a tool that is part of the Knowledge Discovery process and
can be defined as the process of discovering hidden information from existing data
repositories. It is a collection of processes specifically designed to derive highly valuable
information, which when viewed in the context of the domain it was mined in, provides
knowledge and insight to the user.
Question 2: What are the current data mining pre-processing tasks and how are
they performed?
The data mining pre-processing tasks are as follows:
Data clean-up – This part of the process is still very much a manual process,
missing data inconsistencies and duplicate data need to be sorted out at a
domain level.
Data denormalisation – In order to get the relational data into a format usable
by the algorithms it must be transformed and this is still very much a user-driven
process.
Clustering – This is the process of detecting subsets of similar data and grouping
them together. There is a multitude of algorithms available to cluster data and
the traditional mathematical approaches were introduced.
Classification – This is the process of arranging data into a pre-defined class.
The K-nearest neighbour approach was introduced.
Dimensionality reduction – This is the process of reducing the dimension of
the dataset in order to improve classification performance. The principal
component analysis and its mathematical fundamentals were introduced.
191
Question 3: How can multi-agent technology embedded with CI orchestrate and
execute the pre-processing tasks?
The pre-processing system that was introduced caters for four of the five pre-
processing tasks defined. The data clean-up task remains one for the domain user to
undertake, as no computer system (to date) is capable of fixing domain-related
problems in datasets.
The other four tasks were orchestrated and automated by making use of multi-agent
technology and are discussed herewith:
Data denormalisation – Once the user has selected the target table structure
the agents can perform either a feature vector process to produce the pattern set
required for clustering or construct a feature vector for the classification process
to run. The results of the selection are stored in a data mining mart.
Clustering – Once the pattern has been constructed the agents can execute the
classification of the data stored in the mart and produce a set of cluster centres
for the pattern. The centres are then also stored in the data mart.
Classification as means of dimensionality reduction – When the feature
vectors have been constructed the agents can execute the clustering process that
will result in the storage of both a transformation matrix and a reduced dataset.
Question 4: How viable is it to maintain a data mining mart with pre-processed
relational data?
The results of the PSO and GA based clustering and classification algorithms produced
the following conclusions.
The process of clustering is not as complex as classification but the data
transformations required to perform clustering is. The converse is true for classification
in that the process is much more complex but the data transformations are not.
The clustering process yielded good results and a relatively acceptable runtime
performance because the major part of the time used is for the transformation of the
192
data. This is done in the first phase and stored for the selected structures. The
classification process alluded to two distinct conclusions: firstly, that the curse of
dimensionality plays a role and that processing time dramatically increases with the
number of features, and secondly, that the training set is of utmost importance, the
more data there is the more accurate the reduced set can be classified. The trade-off is
that as the amount of data increases so too will the processing time.
The viability of maintaining a data mart consisting of patterns, vectors, cluster centres
and reduced datasets ultimately comes down to system performance. However as was
seen in the results on classification, once the feature transformation matrix is built it
can reduce the number of features that will ultimately speed up the actual classification
process on the real data. The author is thus faced with the debate of accuracy versus
time required. If the algorithm can produce an acceptable result in an hour but takes 12
hours to refine the answer how much better is 12 times the effort. There is however a
solution to this dilemma which is posed in the section on future work.
13.4 Future work
High dimensional datasets contain many features which if combined with the sheer
magnitude of data often found in these datasets might lead to performance implications
on single CPU implementations. The aim of this research was to ascertain whether a
data mining mart can be maintained from the relational data sources and the answer to
that question is yes, even on a single dual-core system as used in the test cases it is
viable if performance is not a key concern.
The choice of employing multi-agent technology and CI-based algorithms help to
alleviate performance limitations as these components can be parallelised either on
distributed systems or even ported to GP-GPU implementations. Furthermore,
enhancements to the PSO and GA to improve diversity and stability needs to be
investigated.
13.5 Conclusion
In this dissertation an overview of the history of data storage was provided to the end of
enlightening the reader to the reasons behind the structure of modern data storage
193
constructs. Data warehousing systems were discussed and their role within the process
of knowledge discovery became clear. An overview of the process of data mining and
the pre-processing tasks were also given.
A data pre-processing system was presented which allows for the automated
orchestration of almost all the major data mining pre-processing tasks. The system was
able to denormalise selected database tables, calculate cluster centres and reduced
datasets after the user has selected the structure required.
The future work plays a major role in the success of implementing this system in a
corporate environment due to the performance constraints posed by high-dimensional
data. However the advantages of having pre-processed data in the form of cluster
centres and reduced datasets for rule extraction algorithms to run on are clear. Besides
data clean-up the system is capable of performing these tasks without user intervention
and thus if implemented correctly, it could aid in data mining applications.
The trade-off between accuracy and time will always be there but with improvements in
computer hardware and the software process of parallelising algorithms means that a
system like this, when setup for chosen database tables, could run in the background
and generate up-to-date pre-processed data without the need of users overseeing the
process.
194
References
[Ada10] Adamson, C (2010). Star Schema The Complete Reference ISBN 0071744320
[Agg13] Aggarwal, C.C., Reddy, K.C. (2013) Data Clustering: Algorithms and Applications (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series) ISBN 1466558210
[Agg12] Aggarwal, C.C., Zhai, C. (2012) Mining Text Data ISBN 1461432227
[Ant09] Anthes, G. (2009). Deep Data Dives Discover Natural Laws. Communications of the ACM, Volume 52(11) pages 13-14. (Accessed 23/03/2010).
[And06] Anderson, B.J., Gross, D.S., Musicant, D.R., Ritz, A.M., Smith, T.G. & Steinberg, L.E. (2006). Adapting K-Medians to Generate Normalized Cluster Centers (Abstract). Proceedings of the sixth SIAM international conference on Data Mining held in Bethesda, Maryland, April 20-22, 2006.SIAM. (Accessed 10/05/2012).
[Aff09] Affenzeller, M. et al. (2009) Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications (Numerical Insights) ISBN 1584886293
[Atr] Atramentov, A & Honavar, A. (n.d). Speeding Up Multi-Relational Data Mining. Available from: citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5 (Accessed 26/10/2013).
[Alf07] Alfred, A & Kazkov, D. (2007). Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique (Abstract). Ioannidis,Y., Novikov. & Rachev, B. (Eds.) (© 2007). Local Proceedings of ADBIS 2007, pages 136-147. Technical University of Varna Available from: www.adbis.org/docs/lp/13.pdf (Accessed 10/05/2012).
[Ana06] Analoui, M. & Amiri, M.F. (2006). Feature Reduction of Nearest Neighbour Classifiers using Genetic Algorithm, Proceedings of the World Academy of Science, Engineering and Technology. Available from:M Analoui, MF Amiri - World Academy of Science, Engineering and …, 2006 - Citeseer (Accessed 10/05/2012).
195
[Ban05] Bandyopadhyay, S., Maulik, U., Holder, L.B. & Cook, D.J. (eds). (2005). Advanced Methods for Knowledge Discovery from Complex Data, Springer, Advanced Information and Knowledge Processing, volume XVIII
[Bie03] Biere, M. (2003). Business Intelligence for the Enterprise, IBM Press, ISBN 0131413031
[Bla02] Blackwell, T.M. & Bentley, P.J. (2002) Dynamic Search with Charged Swarms, Proceedings of the Genetic and Evolutionary Computation Conference, pages 19–26. ISBN: 1-55860-878-8.
[Blo03] Blockeel, H. & Sebag, M. (2003). Scalability and efficiency in multi-relational data mining (Abstract). SIGKDD Explorations, 5:17-30. Available from: citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5511 (Accessed 10/05/2012).
[Boz03]
Bozdogan, H. (2003) Statistical Data Mining and Knowledge Discovery ISBN 1584883448
[Bra13] Bramer, M. (2013) Principles of Data Mining (Undergraduate Topics in Computer Science) ISBN 1447148835
[Bul12] Bulusu, L. (2012) Open Source Data Warehousing and Business Intelligence ISBN 1439816409
[Bur10] Burd, S.D. (2010) Systems Architecture [Paperback] ISBN 0538475331
[Cun08] Cunningham, P. (2007). Dimensionality Reduction: A Comparative Review. University College Dublin, Technical Report UCD-CSI-2007-7 August 8th, 2007. Available from: www.bradblock.com/Dimension_Reduction.pdf (Accessed 10/05/2012).
[Can06] Cano, J.R., Herrera, F. & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3):323–332. Available from: http://www.sciencedirect.com/science/article/B6W86-4G24X8J-1/2/bde2d2186e16a70cdcdb38480f7f217e) (Accessed 15/03/2010).
[Car12a] Carnegie mellow University (2012) Multi-agent systems Available from: http://www.cs.cmu.edu/~softagents/multi.html
196
[Car12b] Carreira-Perpiñán, M Á. (n.d.). Dimensionality Reduction by Unsupervised regression, Lecture slide, EECS, UC Merced, California. Available from: faculty.ucmerced.edu/mcarreira-perpinan/.../cvpr08b-slides.pdf (Accessed 10/05/2012).
[Cel12] Celco, J. (2012) Joe Celko's Trees and Hierarchies in SQL for Smarties, Second Edition (The Morgan Kaufmann Series in Data Management Systems) ISBN 0123877334
[Che04] Chen, C. & Ye, F. (2004). Particle Swarm Optimization Algorithm and Its Application to Clustering Analysis. Proceedings of the 2004 IEEE International Conference on Networking, Sensing Control held in Taiwi, Taiwan, March 21-23, 2004. Available form: dsp.szu.edu.cn/.../Clustering/02....
[Chi05] Di Chio, C., Poli, R. & Langdon, W.B. (2005). Evolution of Force-Generating Equations for PSO using GP, Article available from the Department of Computer Science, University of Essex, UK. Available from: http://cswww.essex.ac.uk/staff/poli/papers/gsice2005.pdf (Accessed 20/07/2013).
[Chu05] Hugh A., Chipman, H.A. & Gu, H. (2005). Interpretable Dimension Reduction, 32(9):969-987. Available from: www.mscs.dal.ca/~hgu/CJAS116847.pdf (Accessed 12/11/2012).
[Cor12] Coronel C, et al H. (2012). Database Systems: Design, Implementation, and Management ISBN 1111969604
[Dar06] Darmont, J., & Boussaïd, O. (eds). (2006).Processing and Managing Complex Data for Decision Support, IGI Global, ISBN 1591406552
[Dat05] Data Mining Preprocessing. (2005). Lecture 3, MATH3346 -Data Mining, S2/2005.Available from: medecine.forumed.org/pdflike.php?search=Data%20Mining (Accessed 10/05/2012).
[Dat08] Date, C.J. (2008). The Relational Database Dictionary, Extended Edition, firstPress, ISBN 1430210419
[Dat12] Date, C.J. (2012). Database design and Relational theory: normal forms and all that Jazz (Theory in Practice) ISBN 1449328016
197
[Dav08] Davidson, L., Kline, K., Klein S. & Windisch, K. (2008). Pro SQL Server 2008. Relational Database Design and Implementation, Apress, ISBN 143020866X
[Deb06] Dehuri, S., Ghosh, A. & Mall, R. (2006). Generic Algorithms for Multi-Criterion Classifications and Clustering in Data Mining. Journal of Computing and Information Sciences,4 (3):143-154. Available from: www.lania.mx/~ccoello/EMOO/dehuri06.pdf.gz (Accessed 10/05/2012).
[Deh08] Dehuri, S., Patnaik, S., Ghosh, A. & Mall, R. (January 2009). Application of elitist multi-objective genetic algorithm for classification rule generation. Applied Soft Computing, 8(1):477–487. Available from: http://www.sciencedirect.com/science/article/B6W86-4NB2WF0-2/2/b637ae128f8a5bb36ddf0f4e93a8ad44 (Accessed 15/03/2010).
[Doh03] Doherty, M. (2003).Database Cluster for e-Science, Proceedings of UK e-Science All Hands Meeting 2003. Available from: http://www.eminerals.org/papers/AH_2003_databasecluster.pdf (Accessed 01/06/2010).
[Dou08] Dou,W., Hu, J., Hirasawa, K. & Wu, G. (2008). Distributed Multi-Relational Data Mining Based on Genetic Algorithm (Abstract). 2008 IEEE Congress on Evolutionary Computation (CEC 2008). Available from: www.hflab.ips.waseda.ac.jp/~jinglu/Publics/cec08Dou.pdf
[Eng07] Engelbrecht, A. P. (2007). Computational Intelligence : An Introduction, 2nd edition, John Wiley &Sons. ISBN 1-843-34355-X
[Eve11] Everit, S.B. et al (2011) Cluster Analysis (Wiley Series in Probability and Statistics) ISBN 0470749911 2011
[Fon09] Fonesca, C.M. et al. Evolutionary Multi-Criterion Optimization: 5th International Conference, EMO 2009, Nantes, France, April 7-10, 2009, Proceedings (Lecture Notes in ... Computer Science and General Issues) ISBN 3642010199
[Far07] Faraoun, K.M. & Rabhi, A. (2007). Data dimensionality reduction based on genetic selection of feature subsets(Abstract). DOI: 10.1145/314403.314414. Available from: thescipub.com/pdf/10.3844/jcssp.2011.707.714 - United States
198
[Fel07] Feldman, R. & Sanger, J. (2007).The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, ISBN 0521836573
[Fer13] Fernandez, A et al. An Overview on the Structure and Applications for Business Intelligence and Data Mining in Cloud Computing 7th International Conference on Knowledge Management in Organizations: Service and Cloud Computing Advances in Intelligent Systems and Computing Volume 172, 2013, pp 559-570 DOI 10.1007/978-3-642-30867-3_50
[Fra96] Franklin, S.; Graesser, A. (1996). "Is it an Agent, or just a Program?: A Taxonomy for Autonomous \ Agents University of Memphis, Institute for Intelligent Systems
[Fre05] Freitas, A. A. (2005). Evolutionary Algorithms for Data Mining. Chapter 20 in Data Mining and Knowledge Discovery Handbook, pages 435– 467, US: Springer.
[Gan07] Gan, G. et al (2007) Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability) ISBN 0898716233
[Gho04] Ghosh, A. & Nath, B. (June 2004). Multi-objective rule mining using genetic algorithms. Information Sciences, 163(1-3):123–133, Soft Computing and Data Mining. Available from: http://www.sciencedirect.com/science/article/B6V0C-4B3JHTS-1/2/03b327519650d278df6aeed05059393c (Accessed 15/03/2010).
[Gho06] Ghosh, J. Proceedings of the Sixth SIAM International Conference on Data Mining (2006) Page 166. (Accessed 25/10/2013).
[Giu12] Giusti, A. et al. (2012) Classification and Data Mining (Studies in Classification, Data Analysis, and Knowledge Organization) ISBN 3642288936
[Gra12] Graham, A. (2012) The Enterprise Data Model: A framework for enterprise data architecture, 2nd edition ISBN 0956582915
[Gup10] Gupta, R.V. (2010). Whitepaper: Introduction to Data Warehousing, Published on www.system-services.com System Services Corporation. Available from: http://www.system-services.com/dwintro.asp (Accessed: 08/04/2010).
199
[Gup09]
Gupta, S.B. & Mittal, A. (2009).Introduction to Database Management System, Laxmi Publications, ISBN 9381159319
[Had13] Apache Hadoop project. http://hadoop.apache.org/ (2013)
[Han11] [Han, J et al. Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems) ISBN 0123814790
[Han10] Han, J. (September 21, 2010). Data Preprocessing, Chapter 2 adapted from Han, J. & Kamber, M. (2006) Data Mining: Concepts and Techniques. Available from: www.liacs.nl/~erwin/dbdm2010/03_dbdm2010_Data_Preprocessing.pdf (Accessed 10/05/2012).
[Hib13] JBoss Hibernate ORM for Java http://hibernate.org/ (2013)
[Hus02] Hussein, A., Sarker, R. A. & Newton, C. S. (eds). (2002). Data Mining: A Heuristic Approach, IGI Global. ISBN 1930708254
[Hau04] Haupt, R. Haupt, S.E. (2004) Practical Genetic Algorithms ISBN 0471455652
[Hal08] Halpin, T. (©2008). Information Modeling and Relational Databases, Second Edition. Morgan Kaufmann ublishers, Books24x7. (Electronic Version). ISBN 0123735688
[Har08] Harizopoulos, S., Abadi, D.J., Madden, S. & Stonebraker, M. (2008). OLTP through the looking glass, and what we found there. Proceedings of the 2008 ACM SIGMOD international Conference on Management of Data held in Vancouver, Canada, SIGMOD '08. ISBN 978-1-60558-102-6
[Iba13]
MyBatis Project for Java http://blog.mybatis.org (2013)
[Inm05] Inmon, W. H. (2005). Building the Data Warehouse, Fourth Edition, John Wiley & Sons, ISBN 978-0-7645-9944-6
[IBM07] IBM. (2007) The IBM mainframe, Building on the past, defining the future, Published on www-07.ibm.com. Available from: http://www-07.ibm.com/systems/my/z/about/timeline/ (Accessed: 12/04/2010).
200
[Kai08]
Kain, F. (May 2008). How far is Pluto from Earth? , (Electronic Version). Published on www.universetoday.com Available from: http://www.universetoday.com/guide-to-space/pluto/how-far-is-pluto-from-earth/ (Accessed 08/04/2010)
[Kar08] Kari, L. & Rozenberg, G. (2008). The many facets of natural computing. Communications of the ACM, 52(10):72-83.
[Kay10] Kaya, M. (April 2010). Autonomous classifiers with understandable rule using multi-objective genetic algorithms, Expert Systems with Applications, 37(4):3489–3494. Available from:http://www.sciencedirect.com/science/article/B6V034XFY0F0- 10/2/6bd705a14d2061aaed9a70b5549219b3 (Accessed 15/03/2010).
[Kep10] Kepliz, B. Verbrugge, R. (2010) Teamwork in Multi-Agent Systems: A Formal Approach (Wiley Series in Agent Technology) ISBN 0470699884
[Khu08] Khushaba, R.N., Al-Ani, A. & Al-Jumaily, A.(2008). Differential Evolution based Feature Subset Selection. 19th International Conference on Pattern recognition, ICPR 2008, pages 1-4, IEEE. Available from: RN Khushaba, A Al-Ani… - Pattern Recognition, 2008. …, 2008 - ieeexplore.ieee.org (Accessed 10/05/2012).
[Kim08] Kimball, R. (2008).The Data Warehouse Lifecycle Toolkit, Second Edition, John Wiley & Sons. ISBN 978-0-470-14977-5
[Kim13] Kimball, R. (2013).The Data Warehouse Toolkit: The definitive guide to dimensional modelling, Third Edition, ISBN 1118530802 (Accessed 16/08/2013).
[Kir13] Kiranyaz, S et al. Multidimensional Particle Swarm Optimization for Machine Learning and Pattern Recognition (Adaptation, Learning, and Optimization) ISBN 3642378455
[Kog13] Kogan, J (2013) High Dimensional Introduction to Clustering Large and High-Dimensional Data ISBN 0521617936
[Kom11] Komarasamy, G & Amitabh, W Improving the cluster performance by combining PSO and K-Means algorithm. Department of Computer Science and Engineering Bannari Amman Institute of Technology (http://ictactjournals.in/ArticleDetails.aspx?id=405)
201
[Kuo09] Kuo, R.J., Chao, C.M. & Chiu, Y.T. (November 2009). Application of particle swarm optimization to association rule mining. Applied Soft Computing. Available from: http://www.sciencedirect.com/science/article/B6W86-4XSJVHK-7/2/5d3748225826351222be084cbd91b141 (Accessed 15/03/2010).
[Lab11] Laberge, R The Data Warehouse Mentor: Practical Data Warehouse and Business Intelligence Insights ISBN 0071745327
[Lan07] Langit, L. (2007). Foundations of SQL Server 2005 Business Intelligence, Apress. ISBN 1-59059-834-2
[Led13] Ledolter, J. (2013) Data Mining and Business Analytics with R ISBN 111844714X
[Lig07] Lightstone, S., Teorey, T. & Nadeau, T. (2007). Physical Database Design: The Database Professional's Guide to Exploiting Indexes, Views, Storage, and More, Morgan Kaufmann Publishers, ISBN 9780123693891
[Lob09] Lobel, L., Andrew J.B. & Forte, S. (2009). Programming Microsoft SQL Server 2008, ISBN 9780735626027
[Lon09] Longbing, C., Gorodetsky, V. & Mitkas, P.A. (May-June 2009). Agent Mining: The Synergy of Agents and Data Mining. Intelligent Systems, Journal IEEE Intelligent Systems, May/June 2009 (vol. 24 no. 3), May, 2009, 64-72.
[Mai08] Maimon, O. & Rokach, L. (2008). Soft Computing for Knowledge Discovery and DataMining. Introduction to Soft Computing for Knowledge Discovery and DataMining, pages 1–13. US: Springer.
[Man05] Manolopoulos, Y., Papadopoulos, A. N. & Vassilakopoulos. M Gr. (eds). (2005). Spatial Databases: Technologies, Techniques and Trends, IGI Global, ISBN 1591403871
[Mic02] Corporation, Microsoft. Microsoft Computer Dictionary, (2002). Fifth Edition, Microsoft Press, ISBN 735614954
202
[Mur06] [Nos13]
Murphy, C. (2006). Effects of Swarm Size on Attractive-Repulsive Particle Swarm Optimisation. Available from: http://ncra.ucd.ie/COMP40580/crc2006/murphy.pdf (Accessed 18 February 2012). NoSQL database. http://nosql-database.org/
[Ols08] Olson, D.L. & Delen, D. (2008) Genetic Algorithm support to Data Mining. Chapter 8 in Advanced Data Mining Techniques, pages 125–135. Berlin Heidelberg: Springer.
[Ora11] Oracle® Database Concepts 11g Release 1 (11.1) B28318-06 (2011) Available from:http://docs.oracle.com/cd/B28359_01/server.111/b28318/schema.htm (Accessed 20/10/2013)
[Ols12] Olsson A,E. Particle Swarm Optimization: Theory, Techniques and Applications (Engineering Tools, Techniques and Tables) (Editor) ISBN 1616685271
[Oxf13] The Oxford English Dictionary: Intelligence Available from: http://www.oxforddictionaries.com/definition/english/intelligence?q=intelligence (Accessed 20/10/2013)
[Pas07] Pascalau, E. (February 2007).Smart Agent Based on Rules for Web Knowledge Mining. Digital EcoSystems and Technologies Conference,held in Cairns,DEST '07, Inaugural IEEE-IES, pages 458–461. ISBN 1-4244-0470-3
[Pei98]
Pei, M, Goodman, E, Punch, W.F. (1998) Feature Extraction using Genetic Algorithms. Case centre for Computer-aided engineering and manufacturing, Michigan university.
[Pow06]
Powell, G. (2006). Beginning Database Design and Implementation, Wrox Press, ISBN 0764574906
[Pow03] Power, D.J. (May 31, 2003). A Brief History of Decision Support Systems. Published on DSSResources.COM, World Wide Web, version 2.8. Available from: http://DSSResources.COM/history/dsshistory.html (Accessed 08/04/2010).
203
[Qu10] QU, J., Shao, Z. &, Xiyu LIU, X. (June 2010). Mixed PSO Clustering Algorithm Using Point Symmetry Distance (Abstract). Journal of Computational Information Systems 6(6): 2027-2035. Available from: www.jofcis.com/publishedpapers/2010_6_6_2027_2035.pdf (Accessed 24/10/2013).
[Raj11] Rajaraman, A. Ullman, D,J. (2011) Rule Mining Mining of Massive Datasets ISBN 1107015359
[Ray10] Rayner, A. (2010). Summarizing Relational Data Using Semi-Supervised Genetic Algorithm-Based Clustering Techniques. Journal of Computer Science 6 (7): 775-784. Available from: thescipub.com/abstract/10.../jcssp.2010.775.784 - United States (Accessed 17/09/2013).
[Red12]
Redmond, E . Wilson, J.R Seven(2012) Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement ISBN – 13-978-1-93435-692-0
[Rob07] Rob, P. & Coronel, C. (2007). Database Systems – Design, Implementation
and Management, 7th Edition, Course Technology.
[Rob12] Rob, P. & Coronel, C. (2012). Database Systems – Design, Implementation and Management, 10th Edition, Course Technology. ISBN 1111969604
[Rui09] Xu, R. & Wunsch, D. (2009). Don. Clustering, IEEE Press,
[Rus10]
Russel, S. & Norvig, P. (2010). Artificial Intelligence: A modern approach, Pearson Education Inc.
[Sch10] Schrader, M. (2012). Oracle Essbase & Oracle OLAP: The Guide to Oracle's Multidimensional Solution, Oracle Press, ISBN 0071621822
[Sho08]
Shoham, Y. Leyton-Brown, K. (2008) Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations ISBN 0521899435
[Siv08] Sivanandam, S.N. & Deepa, S.N. (2008). Genetic Programming. Chapter 6 in Introduction to Genetic Algorithms, pages 131–163, Berlin Heidelberg: Springer. ISBN 978-3-540-73189-4
204
[Sik07] Sikora, R. & Piramuthu, S. (July 2007). Framework for efficient feature selection in genetic algorithm based data mining. European Journal of Operational Research, 180(2):723–737. Available from: http://www.sciencedirect.com/science/article/B6VCT-4K1X5KR-6/2/4140bd52594e345ef0a13168fc1e0991 (Accessed 15/03/2010).
[Sil08] Silvers, F. (2008). Building and Maintaining a Data Warehouse. Auerbach Publications, ISBN 1420064622
[Sim13] Simon, D. (2013) Evolutionary Optimization Algorithms ISBN 0470937416
[Sou04] Souse, T., Silva, A. & Neves, A. (May-June 2004). Particle Swarm based Data Mining Algorithms for classification tasks. Parallel Computing, 30(5-6): 767–783, Parallel and nature-inspired computational paradigms and applications. Available from: http://www.sciencedirect.com/science/article/B6V12-4CDJKX0-1/2/79dae1807d77bbaafc5d624822e17ddb (Accessed 15/03/2010).
[Sta07] Stackowiak, R., Joseph R. & Greenwald, R. (2007). Oracle Data Warehousing and Business Intelligence Solutions, John Wiley & Sons. ISBN 0471919217
[Ste09] Stephens, R. (2009). Beginning Database Design Solutions, Wrox Press, ISBN 1861004966
[Tab09]
Tabatabaei, N.M., Mortezaeei S.R. (2009) REVIEW OF MULTI-AGENT SYSTEMS (MAS), A NEW TOOL FOR THE CONTROL AND MANAGEMENT OF MODERN POWER SYSTEMS International Journal on “Technical and Physical Problems of Engineering” (IJTPE) Transaction on Power Engineering
[Tan08] Taniar, D. (ed). (2008). Data Mining and Knowledge Discovery Technologies, IGI Global ISBN 1599049600
[Tan09] Tan, K.C., Teoh, E.J., Yu, Q. & Goh, K.C. (May 2009). A hybrid evolutionary algorithm for attribute selection in data mining. Expert Systems with Applications, 36(4):8616–8630. Available from: http://www.sciencedirect.com/science/article/B6V03-4TTMJP3-5/2/72c347f15102bbe142ac69aad1fab7ec (Accessed 15/03/2010).
205
[Tra09] Trautmann, H., Wagner, T., Naujoks, B., Preuss, M. & Mehnen, J. (Winter 2009). Statistical methods for convergence detection of multi-objective evolutionary algorithms. Evolutionary Computation, 17(4): 493–509.
[Uma13] Umakant, Mishra Introduction to Management Information System (August 8, 2013). Available at SSRN: http://ssrn.com/abstract=2307474 or http://dx.doi.org/10.2139/ssrn.2307474
[Uhr09] Uhrmacher, A.M. Weyns, D. (2009) Multi-Agent Systems: Simulation and Applications (Computational Analysis, Synthesis, and Design of Dynamic Systems) ISBN 1420070231
[Van12] Van der Lans, R. Data Virtualization for Business Intelligence Systems: Revolutionizing Data Integration for Data Warehouses (The Morgan Kaufmann Series on Business Intelligence) ISBN 0123944252
[Vog06] Voges, K. E.& Pope, Nigel K. Ll. (eds). (2006). Business Applications and Computational Intelligence, IGI Global, ISBN 9781591407027
[Wal07] Walkenbach, J. (2007). “Excel 2007—Where It Came From” Excel 2007 Power Programming with VBA, Chapter 1, John Wiley ISBN 978-0-470-04401-8
[Wan05] Wang, L. & Xiuju, F. (2005). Data Mining with Computational Intelligence, Chapter 1: Introduction, pages 1-5, Springer-Verlag Berlin Heidelberg.
[Wei00] Weiss, G. (2010) Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence ISBN 0262731312
[Wit05] Witten, I. H. & Frank, E. (2005). Data Mining – Practical Machine Learning Tools and Techniques, Second Edition, Elsevier Inc.
[Won02] Wong, M. L. & Leung, K.S. (2002). Introduction. Chapter 1 in Data Mining using grammar based genetic programming and applications, Kluwer Academic Publishers.
[Woo09] Woolridge, M & Jennings, NR (2009) Introduction. Intelligent Agents theory and practice
206
[Wra09] Wray, R. (May 2009). Internet data heads for 500bn gigabytes, (Electronic Version). Published on www.guardian.co.za Available from: http://www.guardian.co.uk/business/2009/may/18/digital-content-expansion (Accessed: 27/03/2010).
[Wre07] Wrembel, R. & Christian K. (2007). Data Warehouses and OLAP: Concepts, Architectures and Solutions.,IGI Global, ISBN 1599043645
[Yan11] Yang, C., Hsiao,C. & Chuang, L. (2011). Accelerated Chaotic Particle Swarm Optimization for Data Clustering.2009 International Conference on Machine Learning and Computing, IPCSIT vol.3 (2011) IACSIT Press, Singapore. Available from:www.ipcsit.com/vol3/045L015.pdf (Accessed 10/05/2012).
[Ye05] Ye,F.& Chen,C.(2005). Alternative KPSO-Clustering Algorithm (Abstract).Tamkang Journal of Science and Engineering, 8(2)165-174. Available from:www2.tku.edu.tw/~tkjse/8-2/8-2-9.pdf (Accessed 10/05/2012).
[Zam08] Zamalloayz, M., Rodriguez-Fuentesy, L.J., Penagarikanoy,M., Bordely,G. & Uribez, J.P.(2008 ). Feature Dimensionality reduction through genetic algorithms for faster speaker recognition (Abstract). 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008. Available from: gtts.ehu.es/gtts/NT/fulltext/ZamalloaEUSIPCO08b.pdf (Accessed 10/05/2012).
[Zan99] Zaïane,O.R. (1999). Introduction to Data Mining, Chapter I in CMPUT690 Principles of Knowledge Discovery in Databases. Available From: webdocs.cs.ualberta.ca/~zaiane/courses/.../Chapter1/index.html (Accessed 10/05/2012).
[Zon05] Ma, Z. (ed). (2005). Advances in Fuzzy Object-Oriented Databases: Modeling and Applications, IGI Global, ISBN 1591403855
207
Addendum A
Raw Iris Dataset sample
The Iris dataset is a multivariate dataset consisting of data that quantify the
morphologic variation of Iris flowers of three related species: I. setosa, I. versicolor and
I. virginica. The dataset consists of 150 examples, 50 of each plant and has four
attributes which are measured in centimetres as shown in Table A.1.
Table A.1: The Iris dataset
Sepal length Sepal width Petal length Petal width Species
5.1 3.5 1.4 0.2 I. setosa
4.9 3.0 1.4 0.2 I. setosa
4.7 3.2 1.3 0.2 I. setosa
4.6 3.1 1.5 0.2 I. setosa
5.0 3.6 1.4 0.2 I. setosa
5.4 3.9 1.7 0.4 I. setosa
4.6 3.4 1.4 0.3 I. setosa
5.0 3.4 1.5 0.2 I. setosa
4.4 2.9 1.4 0.2 I. setosa
4.9 3.1 1.5 0.1 I. setosa
5.4 3.7 1.5 0.2 I. setosa
4.8 3.4 1.6 0.2 I. setosa
4.8 3.0 1.4 0.1 I. setosa
208
Sepal length Sepal width Petal length Petal width Species
4.3 3.0 1.1 0.1 I. setosa
5.8 4.0 1.2 0.2 I. setosa
5.7 4.4 1.5 0.4 I. setosa
5.4 3.9 1.3 0.4 I. setosa
5.1 3.5 1.4 0.3 I. setosa
5.7 3.8 1.7 0.3 I. setosa
5.1 3.8 1.5 0.3 I. setosa
5.4 3.4 1.7 0.2 I. setosa
5.1 3.7 1.5 0.4 I. setosa
4.6 3.6 1.0 0.2 I. setosa
5.1 3.3 1.7 0.5 I. setosa
4.8 3.4 1.9 0.2 I. setosa
5.0 3.0 1.6 0.2 I. setosa
5.1 3.8 1.6 0.2 I. setosa
4.6 3.2 1.4 0.2 I. setosa
5.3 3.7 1.5 0.2 I. setosa
5.0 3.3 1.4 0.2 I. setosa
7.0 3.2 4.7 1.4 I. versicolor
6.4 3.2 4.5 1.5 I. versicolor
209
Sepal length Sepal width Petal length Petal width Species
6.9 3.1 4.9 1.5 I. versicolor
5.5 2.3 4.0 1.3 I. versicolor
6.5 2.8 4.6 1.5 I. versicolor
5.7 2.8 4.5 1.3 I. versicolor
6.3 3.3 4.7 1.6 I. versicolor
5.7 2.8 4.1 1.3 I. versicolor
6.3 3.3 6.0 2.5 I. virginica
5.8 2.7 5.1 1.9 I. virginica
7.1 3.0 5.9 2.1 I. virginica
6.3 2.9 5.6 1.8 I. virginica
6.5 3.0 5.8 2.2 I. virginica
7.6 3.0 6.6 2.1 I. virginica
4.9 2.5 4.5 1.7 I. virginica
7.3 2.9 6.3 1.8 I. virginica
6.7 2.5 5.8 1.8 I. virginica
6.5 3.0 5.2 2.0 I. virginica
6.2 3.4 5.4 2.3 I. virginica
5.9 3.0 5.1 1.8 I. virginica
210
Raw Glass Dataset sample
This glass dataset contains the data of six types of glass; defined in terms of their oxide
content (i.e. Na, Fe, K, etc). The types are:
building_windows_float_processed,
building_windows_non_float_processed,
vehicle_windows_float_processed,
containers, tableware and headlamps.
The dataset consists of 214 examples with 10 attributes and the following is a sample:
1, 1.52101,13.64,4.49,1.10,71.78,0.06,8.75, 0.00,0.00,1 2, 1.51761,13.89,3.60,1.36,72.73,0.48,7.83, 0.00,0.00,1 3, 1.51618,13.53,3.55,1.54,72.99,0.39,7.78, 0.00,0.00,1 101,1.51655,12.75,2.85,1.44,73.27,0.57,8.79, 0.11,0.22,2 102,1.51730,12.35,2.72,1.63,72.87,0.70,9.23, 0.00,0.00,2 103,1.51820,12.62,2.76,0.83,73.81,0.35,9.42, 0.00,0.20,2 152,1.52127,14.32,3.90,0.83,71.50,0.00,9.49, 0.00,0.00,3 153,1.51779,13.64,3.65,0.65,73.00,0.06,8.93, 0.00,0.00,3 154,1.51610,13.42,3.40,1.22,72.69,0.59,8.32, 0.00,0.00,3 164,1.51514,14.01,2.68,3.50,69.89,1.68,5.87, 2.20,0.00,5 165,1.51915,12.73,1.85,1.86,72.69,0.60,10.09,0.00,0.00,5 166,1.52171,11.56,1.88,1.56,72.86,0.47,11.41,0.00,0.00,5 183,1.51916,14.15,0.00,2.09,72.74,0.00,10.88,0.00,0.00,6 184,1.51969,14.56,0.00,0.56,73.48,0.00,11.22,0.00,0.00,6 185,1.51115,17.38,0.00,0.34,75.41,0.00,6.65, 0.00,0.00,6 200,1.51609,15.01,0.00,2.51,73.05,0.05,8.83, 0.53,0.00,7 201,1.51508,15.15,0.00,2.25,73.50,0.00,8.34, 0.63,0.00,7 202,1.51653,11.95,0.00,1.19,75.18,2.70,8.93, 0.00,0.00,7
The 2,4D Dataset
Soil samples were collected from a site that is contaminated with a pesticide known as
2,4-D (dichlorophenoxyacetic acid).
There are three classes, based on three genetically similar microbial isolates, which
show the ability to degrade 2,4-D. The dataset consists of 241 samples with 96
attributes.
211
Iris Dataset Relational structures
Figure A.1 Simple Iris relational structure
Figure A.1 represents the Iris dataset in a simple relational structure with the Flower
species as a foreign key.
Figure A.2 Complex Iris relational structure
Figure A.2 presents a more complex representation of the Iris dataset whereby each
attribute is a foreign key.