intelligent pre-processing for data mining

INTELLIGENT PRE-PROCESSING FOR DATA MINING

by

LUDWIG DE BRUIN

DISSERTATION

submitted in fulfilment of the requirements for the degree of

MASTER OF SCIENCE

in

INFORMATION TECHNOLOGY

at the

UNIVERSITY OF JOHANNESBURG

SUPERVISOR: PROF. E.M. EHLERS

CO SUPERVISOR: DR. G.J. VAN NIEKERK

2013

2

Keywords

Data mining, data pre-processing, classification, clustering, computational intelligence,

intelligent agents

3

Abstract

Data is generated at an ever-increasing rate and it has become difficult to process or

analyse it in its raw form. The most data is generated by processes or measuring

equipment, resulting in very large volumes of data per time unit.

Companies and corporations rely on their Management and Information Systems (MIS)

teams to perform Extract, Transform and Load (ETL) operations to data warehouses on

a daily basis in order to provide them with reports. Data mining is a Business

Intelligence (BI) tool and can be defined as the process of discovering hidden

information from existing data repositories. The successful operation of data mining

algorithms requires data to be pre-processed for algorithms to derive IF-THEN rules.

This dissertation presents a data pre-processing model to transform data in an

intelligent manner to enhance its suitability for data mining operations. The Extract Pre-

Process and Save for Data Mining (EPS4DM) model is proposed. This model will

perform the pre-processing tasks required on a chosen dataset and transform the

dataset into the formats required. This can be accessed by data mining algorithms from

a data mining mart when needed.

The proof of concept prototype features agent-based Computational Intelligence (CI)

based algorithms, which allow the pre-processing tasks of classification and clustering

as means of dimensionality reduction to be performed. The task of clustering requires

the denormalisation of relational structures and is automated using a feature vector

approach. A Particle Swarm Optimisation (PSO) algorithm is run on the patterns to find

cluster centres based on Euclidean distances. The task of classification requires a

feature vector as input and makes use of a Genetic Algorithm (GA) to produce a

transformation matrix to reduce the number of significant features in the dataset. The

results of both the classification and clustering processes are stored in the data mart.

4

Table of Contents

Keywords 2 Abstract 3 List of Figures 8 List of Tables 10 List of Algorithms 11 List of Abbreviations 12 Chapter 1 14 Introduction 14

1.1 Background 14 1.2 Purpose 15 1.3 Research questions 15 1.4 Methodology 16 1.5 Dissertation structure 16 1.6 Conclusion 20

Chapter 2 22 Data storage evolution 22

2.1 Introduction 22 2.2 Background and context 22

2.2.1 The Relational Data Model 26 2.2.2 Segregated environment 27 2.2.3 Viewing of Data 30

2.3 Conclusion 31 Chapter 3 33 Database models 33

3.1 Introduction 33 3.2 Database models 33

3.2.1 The file system models 33 3.2.2 Disadvantages of Traditional File System Models 37 3.2.3 The Relational Model 38 3.2.4 Object-oriented model 43 3.2.5 Object-relational model 46 3.2.6 XML data model 47 3.2.7 Data manipulation 48

3.3 ANSI/SPARC architecture 49 3.4 Important database concepts 49

3.4.1 Schema 50 3.4.2 Virtual Table (View) 50 3.4.3 Metadata 51 3.4.4 Data dictionary 51 3.4.5 System catalog 52 3.4.6 Indexing 52 3.4.7 Transactions and concurrency control 53

3.5 Specialised database models 55 3.5.1 Document-oriented model 55 3.5.2 Temporal model 56

5

3.5.3 Spatial model 55 3.5.4 Fuzzy object model 56 3.5.5 Dimensional model 57 3.5.5 NoSQL databases 57

3.6 Conclusion 57 Chapter 4 60 Data warehousing 60

4.1 Introduction 60 4.2 Business Intelligence 60 4.3 Data warehouse fundamentals 61

4.3.1 Subject oriented 61 4.3.2 Integrated 62 4.3.3 Non-volatility 65 4.3.4 Time variant 66

4.4 Creating analytical data 66 4.4.1 Selecting data sources 66 4.4.2 Integrating the data 68 4.4.3 Loading operational data 69

4.5 Data warehouse and data models 71 4.5.1 Inmon’s top-down approach 71 4.5.2 Kimball’s bottom-up approach 72

4.6 OLAP and data warehouse 75 4.6.1 Data analysis example 75 4.6.2 OLAP overview 78

4.7 Data Warehouse structures and NoSQL databases 79 4.8 Conclusion 79

Chapter 5 82 Data Mining 82

5.1 Introduction 82 5.2 Knowledge Discovery in Databases (KDD) 83 5.3 Overview of the data mining process 84 5.4 Data pre-processing tasks 90

5.4.1 Data denormalisation 90 5.4.2 Clustering 91 5.4.3 Classification 94 5.4.4 Dimensionality Reduction 95

5.5 Dimensionality Reduction techniques 99 5.6 Rule Extraction 101 5.7 Conclusion 102

Chapter 6 105 Intelligent Agents 105

6.1 Introduction 105 6.2 Artificial Intelligence 105 6.3 Intelligent Agent 106 6.4 Multi-agent systems 110 6.5 Conclusion 112

Chapter 7 115 Computational Intelligence 115

6

7.1 Introduction 115 7.2 Computational Swarm Intelligence 115

7.3 Evolutionary Computation 121 7.4 Conclusion 124

Chapter 8 126 The EPS4DM model 126 8.1 Introduction 126 8.2 Traditional ETL compared to the EPS4DM model 126 8.3 Clustering 128 8.4 Classification 132 8.5 Overview of the data mining mart 135 8.6 Conclusion 136 Chapter 9 139 The EPS4DM’s data extraction process 139 9.1 Introduction 139 9.2 Mapping relational structures 139 9.3 Process overview 143 9.4 Conclusion 144 Chapter 10 146 EPS4DM's Clustering process 146 10.1 Introduction 146 10.2 Determining cluster centres 146 10.3 PSO clustering algorithm 148 10.4 Measuring cluster quality 153 10.5 Process overview 154 10.6 Conclusion 154 Chapter 11 157 EPS4DM's Classification Process 157

11.1 Introduction 157 11.2 The GA Feature extractor 157 11.3 Chromosome mapping 160 11.4 Crossover operator 160 11.5 Mutation operator 161 11.6 Selection strategy 162 11.7 KNN rule and training data 163 11.8 Fitness function 163 11.9 GA parameters 165 11.10 Conclusion 165

Chapter 12 168 Implementation and Results 168

12.1 Introduction 168 12.2 Java based implementation 168 12.3 The Test setup 172 12.4 PSO-based clustering results 173 12.5 Clustering conclusion 180 12.6 GA-based classification results 180 12.7 GA algorithm parameters 180 12.8 Randomised datasets 180

7

12.9 UCI-standard datasets 183 12.10 High-dimensional datasets 184 12.11 Classification conclusion 184 12.12 The data mining mart 185 12.13 Conclusion 185

Chapter 13 188 Conclusion and Future work 188

13.1 Introduction 188 13.2 Evaluation 188 13.3 Research-driven questions that had to be answered by the dissertation 189 13.4 Future work 192 13.5 Conclusion 192

References 194 Addendum A 207

8

List of figures

Figure 2-1 Visualisation of extract programs creating child databases 25 Figure 2-2 Visualisation of a shared nothing cluster 28 Figure 2-3 Visualisation of a shared disk cluster 29 Figure 2-4 The segregated database environment 29 Figure 3-1 Example hierarchy of organisation 35 Figure 3-2 Example record structure of an employee 36 Figure 3-3 Example graph of cities showing distances 36 Figure 3-4 Example use of foreign key 44 Figure 3-5 Example of a normalised database structure 45 Figure 3-6 Example OO data model 46 Figure 3-7 ANSI/SPARC architecture 50 Figure 3-8 Example of an indexing tree 53 Figure 4-1 Subject orientation 62 Figure 4-2 Encoding of gender data 63 Figure 4-3 Semantic translation of data 63 Figure 4-4 Conversion of data format 64 Figure 4-5 Time-key hierarchical differences 65 Figure 4-6 The non-volatility of data warehouse data 66 Figure 4-7 Inmon’s data warehouse architecture 72 Figure 4-8 Dimensional modelling – star schema 74 Figure 4-9 Star schema showing derived dimension tables 74 Figure 4-10 Kimball’s data warehouse architecture 75 Figure 5-1 KDP in the production environment 83 Figure 5-2 The data mining process 84 Figure 5-3 Data selection 86 Figure 5-4 Visual representation of outliers 88 Figure 5-5 Linear task-oriented overview of the data mining process 89 Figure 5-6 Clustering example 92 Figure 5-7 Visual representation of hill-climbing clustering 93 Figure 5-8 Divisive clustering 94 Figure 5-9 Visual depiction of K-Nearest Neighbour classification 95 Figure 5-10 Class identification in one dimension 97 Figure 5-11 Class identification in two dimensions 97 Figure 5-12 Class identification in three dimensions 98 Figure 6-1 Architecture for a simple reflex agent 107 Figure 7-1 Particles moving towards the new global best 118 Figure 7-2 Charged particles being repelled 120 Figure 7-3 Crossover operation of two Chromosomes 122 Figure 8-1 Traditional ETL compared to the EPS4DM model 127 Figure 8-2 The clustering process as used in the EPS4DM model 129 Figure 8-3 Associativity between target records [Alf07, Ray10] 130 Figure 8-4 Deriving feature vectors from relational data 130 Figure 8-5 The data clustering process 131 Figure 8-6 Feature Vector normalisation 133 Figure 8-7 Building the normalised feature vector 133 Figure 8-8 Building the transformation vectors [Far07] 134

9

Figure 8-9 The classification process 135 Figure 8-10 The data mining mart 136 Figure 9-1 Associativity between target records combined as feature vectors [Alf07] 140 Figure 9-2 The effect of feature weights to expose outliers [Ray00] 141 Figure 9-3 Example feature vector with indexes to complex types 142 Figure 9-4 Example of a feature pattern being built 142 Figure 9-5 Data extraction process 144 Figure 10-1 PSO particle representing a solution in two dimensions with four clusters 150 Figure 11-1 GA/KNN feature extractor 159 Figure 11-2 Bit mask used for chromosome mapping [Ray00] 160 Figure 11-3 Feature weight used as chromosome mapping 160 Figure 11-4 Crossover operation of mapped feature vectors [Eng07] 161 Figure 11-5 Mutation operator of a single feature vector [Eng07] 162 Figure 12-1 Helper classes within the EPS4DM prototype 170 Figure 12-2 PSO clustering classes within the EPS4DM prototype 171 Figure 12-3 KNN classifier and GA chromosome classes used within the EPS4DM prototype 173 Figure 12-4 GA algorithm and result classes used within the EPS4DM prototype 174 Figure 12-5 K-Means and PSO cluster centre analysis 175 Figure 12-6 K-Means and PSO cluster centre analysis with error rate feedback 176 Figure A.1 Simple Iris relational structure 211 Figure A.2 Complex Iris relational structure 211

10

List of tables

Table 3-1 Example flat file showing login times of employees 34 Table 3-2 Naming conventions 41 Table 4-1 OLTP Invoice Table 76 Table 4-2 OLTP Invoice Line Table 77 Table 4-3 OLAP Dimensional Table 77 Table 11.1 GA Parameters 165 Table 12.1 Clustering data sample from Iris dataset 169 Table 12.2 Classification data sample from Iris dataset 169 Table 12.3 Intra-cluster distances on the Iris dataset 177 Table 12.4 Error rates on the Iris dataset 177 Table 12.5 Intra-cluster distances on the Iris dataset for simple relational mapping 178 Table 12.6 Error rates on the Iris dataset for simple relational mapping 178 Table 12.7 Intra-cluster distances on the Iris dataset for complex relational mapping 178 Table 12.8 Error rates on the Iris dataset for complex relational mapping 178 Table 12.9 Fitness values and cluster centres for three sets of random data 179 Table 12.10 GA-based algorithm parameters 181 Table 12.11 Generated Feature matrix 182 Table 12.12 Transformation matrix 풘∗values 182 Table 12.13 Iris and Glass datasets 184 Table 12.14 Error rate of 2, 4-D dataset 185 Table A.1 The Iris dataset 207

11

List of algorithms

General K-Means clustering 91 Hill climbing clustering 93 K-Nearest neighbour classification 95 Principal component analysis 99 Particle Swarm Optimisation (PSO) 116 Genetic algorithm 123 Traditional K-Means to find cluster centres 146 PSO clustering 148 PSO algorithm to calculate cluster centres 151

12

List of abbreviations

ACID Atomicity Consistency Isolation Durability BI Business Intelligence AI Artificial Intelligence ANSI American National Standards Institute ASCII American Standard Code for Information

Interchange CI Computational Intelligence CPSO Charged Particle Swarm Optimisation CRUD Create Read Update Delete DBMS Database Management System DDL Data Definition Language DML Data Manipulation Language EBCDIC Extended Binary Coded Decimal Interchange Code ERD Entity Relationship Diagram ETL Extract Transform Load GA Genetic Algorithm KDD Knowledge Discovery in Databases KDP Knowledge Discovery Process KIF Knowledge Interchange Format KNN K-Nearest Neighbour KQML Knowledge Query and Manipulation Language MAS Multi-Agent System MIS Management Information System OO Object Orientation OODBMS Object-Oriented Database Management System OLAP Online Analytical Processing OLTP Online Transactional Processing ORM Object Relational Mapping O/RDBMS Object-Relational Database Management System PSO Particle Swarm Optimisation RSS Rich Site Summary SPARC Standards Planning and Requirements Committee SQL Structured Query Language RDBMS Relational Database Management System TSP Travelling Salesperson Problem XML eXtensible Mark-up Language

13

1. Introduction

2. Data storage evolution 3. Database models 4. Data

warehousing

5. Data mining 7. Computational

intelligence

9. Data extraction and

denormalisation 10. Clustering 11. Classification

12. Implementation

and Results

13. Conclusion and future work

6. Intelligent Agents

8. Data pre-processing system

EPS4DM Model

14

Chapter 1

Introduction

1.1 Background

With the advent of inexpensive multi-gigabyte storage devices and broadband network

connectivity options, organisations, businesses and individuals are collecting data at an

alarming rate. The sheer volume of data has become so large that it is practically

impossible to visualise the available data, let alone, analyse it.

Human-generated data (i.e. documents, letters and records) constitute only a small

fraction of the overall body of data. Electronic systems such as online searches and

banking transactions are responsible for the bulk of the data. This data is accumulated

in the hope that it might be of use at a later stage, but according to Witten there is an

ever growing gap between data and the understanding thereof [Wit05].

The underlying principle is that collated data is effusive in nature, as it needs to

describe transactions, maintain member states and thus allow Online Transaction

Processing (OLTP) systems to meet business’ needs. The normal data mining process

dictates that an Extract, Transform and Load (ETL) process from either a relational

database or data warehouse is used as input for the rule extraction algorithms. This

implies that all data from a relational database or data warehouse are subject to pre-

processing. Organisational rules dictate the frequency of ETL processes. They usually

occur once nightly, which means that data mining pre-processing tasks are run on data

that is already dated.

The objective of this dissertation is to propose a data mining pre-processing system that

will manage the data mining ETL pre-processing tasks and store the pre-processed data

in a data mining mart. The system maintains the data from this mart and thus the data

mining algorithms can utilise the data in this mart as their input. This allows the

algorithms to use this data mining mart as their repository and by having the most up-

to-date data, the algorithms are effectively run whenever the need arises, without

having to spend costly time on performing the pre-processing tasks.

15

1.2 Purpose

The purpose of this dissertation is to provide an overview of the current state of

production OLTP databases and data warehousing structures as the input to the data

mining process. Also to investigate the current data mining pre-processing tasks and

how computational intelligence (CI) based algorithms coupled with multi-agent

technology can manage these tasks.

A data pre-processing model, the EPS4DM model, is proposed which will collate the

required relational data as decided upon by a user, perform the pre-processing tasks

and store the processed data in a data mining mart. This enables the data mining rule

extraction algorithms to use this mart as input source for their process.

The dissertation investigates the viability of implementing agents, embedded with CI, to

orchestrate the pre-processing tasks in order to maintain a data mining mart. To direct

the contributions of this dissertation, four research questions have been devised.

1.3 Research Questions

To enhance the success of this dissertation and to support the goal of designing a pre-

processing system, the following questions are posed to guide the research:

Question 1: Why are OLTP system databases structured the way they are, how is this

data stored and in what way does data warehousing differ from data

mining?

Question 2: What are the current data mining pre-processing tasks and how are they

performed?

Question 3: How can multi-agent technology embedded with CI orchestrate and

execute the pre-processing tasks?

Question 4: How viable is it to maintain a data mining mart with pre-processed

relational data?

16

The following section will provide an overview of the research methodology. 1.4 Research Methodology

The research process is derived from the motivation to model a system to maintain a

data mining mart of pre-processed data. To reach this goal, background to the problem

is given and the motivation formulated. Research questions aimed at answering the key

aspects of the goal of the dissertation are also established.

The second step is to conduct an in-depth literature study to provide insight into

current production database structures, data warehousing, the data mining process and

the pre-processing steps. The literature study addresses research question 1. A process

methodology is required when there needs to be an understanding of the processes

needed to accomplish tasks in a computer science application. An investigation into and

explanation of the data mining processes and the techniques used to perform the pre-

processing tasks addresses research question 2. The study also incorporates the

mathematical foundation of the processes involved [Ama07].

The third step is to propose a model for orchestrating the tasks required by questions 3

and 4. This model encompasses the mathematical, algorithmic and architectural aspects

related to each sub-task of the global process. The experimental methodology is utilised

when solutions to problems within computer science need to be evaluated. This will

help to answer the mentioned questions. The proposed model is then implemented as a

prototype system [Ama07]. The fourth and final step is to provide the empirical results

of the prototype, the viability of maintaining a data mart and any future improvements

of the work.

The next section provides an overview of the structure of the dissertation.

1.5 Dissertation structure

The structure of the dissertation is summarised in Figure 1.1 and each chapter’s content

is discussed in this section.

17

Chapter 1 provides the background to the problem and introduces the research

questions and dissertation structure.

1.5.1 Background

Chapter 2 gives a brief historic overview of how data storage has evolved since the early

days of computing. The purpose of the chapter is to introduce legacy data storage

infrastructures, the way in which they evolved into the now widely adopted relational

model of data, and why the need arose for a segregated data environment.

Chapter 3 provides an overview of the different database models (past and present) and

the fallacies associated with them, this is done to justify why production environments

use the models they do. The purpose of this chapter is to describe the constraints and

rules associated with a production transaction management database (OLTP system)

which must address the data needs of an organisation. The inherent problems of the

relational database can be addressed by making use of an exotic database model.

Chapter 4 gives an overview of data warehousing, why it is needed, how it relates to the

relational database and in what way data is stored in a warehouse. The purpose of the

chapter is to explain why data warehouses exist, how they provide real-time analytical

processing of data (OLAP) and to introduce the concepts behind the inner workings of

the multi-dimensional data structure.

Chapter 5 provides an overview of the data mining process and the way in which it

forms part of the larger process of Knowledge Discovery in Databases (KDD) followed

by an introduction to the important pre-processing tasks and the core mathematical

fundamentals behind them. The purpose of this chapter is to provide the reader with an

understanding of the differences between data mining and data warehousing, introduce

the problems associated with pre-processing and the current techniques used to

accomplish the tasks of clustering and classification as dimensionality reduction

technique.

18

Chapter 6 gives an academic overview of intelligent agents, artificial intelligence and

multi-agent technology. The aim with the chapter is to acquaint the reader with the

concepts of agents and the way in which they are employed in systems to accomplish

tasks. This is followed by an introduction to Artificial Intelligence (AI) as a pre-cursor to

Computational Intelligence (CI).

Chapter 7 provides an overview of computational intelligence and identifies it as a

branch of AI. The purpose of the chapter is to introduce two methods within CI namely

swarm intelligence and genetic algorithms and to provide the reader with the core

concepts behind each algorithm. The model draws upon the mathematical

fundamentals.

1.5.2 Model and Implementation

Chapters 6 and 7 introduce the tools required to construct the pre-processing system

which will be defined in this section.

Chapter 8 gives an architectural overview of the EPS4DM model describing how multi-

agent technology is used to orchestrate the processes of extracting the relational data

into formats required by the classification and clustering processes and in what way the

results of these sub-processes are used to maintain the data mart.

Chapter 9 provides the proposed model for data extraction and denormalisation. The

purpose of the chapter is to introduce the concept of feature vectors and the way in

which the extraction sub-process will construct the vectors as required in the clustering

process.

Chapter 10 provides the proposed model for the clustering process and the necessary

architectural, algorithmic and mathematical models. The purpose of the chapter is to

give a holistic overview of the sub-processes required, namely: a PSO-based cluster

centre calculation, the method in which cluster quality is measured and the persistence

of the clustered data.

19

Figure 1.1: Dissertation structure

1. Introduction


warehousing

5. Data mining 7. Computational

intelligence



12. Implementation

and Results




EPS4DM Model

20

Chapter 11 provides the proposed model for the classification process which does the

dimensionality reduction and provides the architectural, algorithmic and mathematical

models required. The purpose of the chapter is to give a holistic overview of the sub-

processes required, namely: feature vector creation, chromosome mapping, feature

extraction via a GA, the calculation of the reduced set via a classifier and the persistence

of the reduced dataset.

1.5.3 Results and Conclusion

Chapter 12 discusses the results of the implementation as described in Chapters 8-11.

The purpose of the chapter is to provide insight into the performance and accuracy of

CI-based methods used in the prototype as well as to give a global feasibility of

maintaining an up-to-date mart of pre-processed data for the datasets. By way of

conclusion Chapter 13 summarises the dissertation, answers the research questions

posed and serves as a discussion for future work.

1.6 Conclusion

Chapter 1 introduced the purpose of the dissertation by providing a summarised

background and motivation for the study. The chapter also formulated the key research

questions that will direct the research in the dissertation.

The chapter provided an overview of the structure of the dissertation by discussing the

relevance of each chapter to follow. The goal of this dissertation is to propose a data

mining pre-processing system which will maintain a mart of pre-processed data for the

selected dataset.

Chapter 2 provides an insight into what data organisations require in order to function

as well as a brief overview of the history of database technology as it has evolved over

the past half-century.

21

1. Introduction


warehousing

5. Data mining 7. Computational intelligence



12. Implementation

and Results




EPS4DM Model

22

Chapter 2

Data storage evolution

2.1 Introduction

With the advent of inexpensive multi-gigabyte storage devices and broadband network

connectivity options, organisations, businesses and individuals are collecting data at an

alarming rate. The sheer volume of data has become so large that it is practically

impossible to visualise the available data, let alone, analyse it.

Human-generated data (i.e. documents, letters and records) constitutes only a small

fraction of the overall body of data. Electronic systems such as online searches and

banking transactions are responsible for most of the data. This data is accumulated in

the hope that it might be of use at a later stage, but according to Witten there is an ever

growing gap between data and the understanding thereof [Wit05].

There is an estimated 473 billion gigabytes of data on the Internet (correct at time of

publishing) [Wra09]. If you bind this data in books and stacked them one upon another,

it would stretch ten times from Earth to Pluto. Putting this into perspective, the

distance between Earth and Pluto (depending on where on the elliptical orbit Earth

finds itself) is anywhere between 4.2 and 7.5 billion km [Kai08]. Clearly the industry is

faced with astronomical amounts of data, with no tangible means of analysing or using it

all.

2.2 Background and context

By looking at any profit-based business, one can conclude that the key driving factor

would be to increase operating revenue and profits. The only viable method to achieve

this goal is to have information relating to the operation of the business readily

available. Organisational data constitute raw facts; and is often uncorrelated and

difficult to understand or read. Information is the result of processing these raw facts;

thus producing data that has meaning and can be understood.

23

Most companies and organisations are well equipped to collect and store vast quantities

of data relating to the daily operation of their businesses. The difficulty lies in the

analysis of this data and the conversion of the masses of data into useful information,

because this is pivotal to the decision support structures of the entity.

The first question to ask in this regard is: “Where and how is the data stored?” In order

to understand the Information Systems of today, it is necessary to examine the

Information Systems of yesteryear in more detail. The origins of Decision Support

Systems (DSS) – the name given to systems that aid business decisions within an

organisation – and Data Warehousing dates back to the early 1960s, when Information

Technology was still in its infancy [Inm05].

Gupta states that there is a remarkable number of current organisations’ business data

that still reside on legacy mainframe infrastructure. The main reason for this is that

these systems provide the core day-to-day operation of the business and that the risk of

moving data over to modern infrastructure simply outweighs the perceived gain

[Gup10].

In the 1960s the computing environment existed primarily of standalone programs that

made use of a so-called master file. This master file was a flat file that was stored on

magnetic tape, and accessed by a program that was created using punch cards. The data

on the tape-based storage medium had to be accessed sequentially, meaning it had to be

read from beginning to end to access data [Inm05].

Around 1965, the sheer number of master files in use by a typical organisation and the

amount of tape storage required to store these files, became impractical. Access times of

20-30 minutes per tape were not uncommon [Inm05].

The hardware requirements that were needed to house and access the tape media, the

slow access speed of the medium, the inherent complexities of creating and maintaining

programs and the creation of redundant data became significant problems of this era

and alternative solutions were sought to improve the manageability of data [Inm05].

The systems of this era and those following it, were mainly dubbed Management

Information Systems (MIS), as these were systems that provided managers with reports

24

relating to the accounting and transactional data of the organisation [Pow03, Lau10,

Uma13].

The 1970s saw the advent of the magnetic disk storage. In this technology, files stored

on it could be accessed by record number – meaning that it was no longer necessary to

sequentially access data. The user could access a specific record of data quickly and

easily by “jumping” to the address of the record on the disk. This reduced the access

time of data from several minutes to a few seconds. Programmers also created the first

Relational Database Management System (RDBMS) during this time. The essence of this

piece of software was that it could manage the storage of data to the disk medium via

indexes [Inm05, Rob12, Gra12].

The majority of Information Systems of this time made use of IBM's mainframe

architecture. Most programs were written in the Common Business-Oriented Language

(COBOL) programming language, a widely used programming language at the time,

while IBM’s DB2 was used to maintain data [Gup10]. The name given to IBM

mainframes from that era was the System/370™. This system built on the successes of

the System/360™ architecture, which was used during the 1960s by several

organisations; and this was also the architecture used by NASA’s Mission Control during

the 1969 Apollo 11 landing [IBM07].

The advent of the Database Management System (DBMS) resulted in the first real

database. The DBMS was able to provide a single, consolidated source of data for all

processing by applications – which no longer required the use of master files [Inm05].

Another move forward was the creation of the Online Transaction Processing (OLTP)

system. This system provided advanced database features such as concurrency control

(multiple access to the same data by various sources) and multi-threading

(independently controlled sections of instructions that can potentially be executed in

parallel). The OLTP system’s main function was to provide enhanced transaction

processing capabilities for organisations [Har08].

The 1980s saw the introduction of the first Personal Computer (PC), which in terms of

data management could allow the end user to manage and own data electronically. By

25

this time, organisations created massive OLTP systems that managed large quantities of

data. The need quickly arose to have some means of extracting specific information

from these systems. Thus, programmers created the so-called extract programs. These

programs would search the database for specific data that matched the criteria

provided by the user. This data would then be moved to another source, such as a

client side database or file perhaps. This would allow users to retrieve the relevant data

that they required [Inm05].

As shown in Figure 2.1, the child databases represent copies of relevant data from the

parent databases, while the arrows show that the flow of data is processed by the

extraction program.

Figure 2.1: Visualisation of extract programs creating child databases [Inm05]

This duplication of data to other sources is fundamental to a data warehouse, and will

be discussed later. It is worth mentioning that these systems work on the principle of

data residing in a flat file or table format. This implies that no dimensional relation

exists between data. The extraction programs have to search for relevant links in order

Parent database

Parent database

File

Child database

26

to obtain useful information from the system.

The most significant progress relating to databases and data warehousing was however

realised in the last decade of the previous century. It was during this time that the

foundation of modern database systems was laid, the most prominent being:

the underlying models of data,

the structural environment of database,

the way data was viewed, and

metadata (data about data) [Inm05].

The next section will discuss the relational database model that revolutionised the way

in which data was modelled, viewed and processed.

2.2.1 The Relational Data Model

The most defining change in the modelling, processing and viewing of data was the

move to the relational data model. A cornerstone concept of the relational model was to

view data as entities. In 1976, Peter Chen introduced the concept of an Entity-

Relationship Model (ER Model) which models entities, their attributes and the

relationships they participate in. Chen was describing the concepts formulated by Codd

in an earlier publication (which will be discussed later) regarding the relational model

of data [Hal08].

Since its inception many different notations have been used to depict the relationships

that exist between entities. However, the basic semantics remain the same and a simple

ER Model will be used to illustrate how data is stored in databases.

The classical example of students who attend classes offered by lecturers will be used.

Students attend individual classes that are taught by lecturers, who may have one or

more classes to present. Likewise, each student has one or more courses consisting of

classes, which that student has chosen. This is a typical definition of a university

domain.

It can be deduced that the following entities will be required: student, lecturer, class

27

and course. They have specific attributes and are related in some way. An example of

these types of relationships will be depicted in Chapter 3 by means of an Entity-

Relationship Diagram (ERD).

2.2.2 Segregated environment

The data storage mechanisms mentioned thus far primarily contained the operational

data of organisations (transactional data). One refers to them as operational databases

and performance is fundamental to their design.

They are however not well suited for providing useful and correlated information to

end users. Programmers developed On-Line Analytical Processing (OLAP) tools to

address this limitation. OLAP tools are specifically designed to analyse data in ways that

running SQL queries on relational models cannot. OLAP is a tool suite that allows data to

be modelled and analysed in the same way the user would think about it. Data is not

broken up into different tables; but dimensions of data are grouped [Har08]. However,

these tools tend to be performance intensive and it is therefore not always a viable

option to analyse the operational data of the organisation. Chapter 4 presents an in-

depth discussion of this.

In terms of database architecture, there was also a shift from mainframe driven

architectures to client-server based approaches, which made the use of multiple PCs for

data management within the organisation more attractive [Har08].

2.2.2.1 Database clustering

A database cluster is a collection of hardware and software that is collectively referred

to as servers that work together as a single system. Although the physical hardware

architecture may reside in various locations, the system serves client requests for data

in the same way than if it were one physical hardware-based system. The building

blocks of a cluster are nodes, an interconnected medium and a shared storage medium.

Each node is a standalone system, with its own processor, memory and operating

system. All the nodes connect to and share the data provided by the shared storage

[Doh03].

28

There are two distinct types of database clusters, namely:

i. Shared nothing clusters

This architecture allows database files to be segregated and distributed on each node.

Thus, the database as a whole is broken down and each node is responsible for only a

small sub-section of the data. The obvious disadvantage of this is that if one or more

nodes fail, only a sub-set of the entire database will be available [Doh03].

Figure 2.2: Visualisation of a shared nothing cluster [Doh03]

ii. Shared disk clusters

This architecture allows database files to be logically shared among a collection of

interconnected nodes, where each node has access to all data. As seen in Figure 2.3 this

can be accomplished by physically linking the database hardware in a network and

allowing servers access to all these data sources. Alternatively an operating system

layer can be used to abstract the viewing of all data sources. The obvious advantage is

high fault tolerance, as each node has access to all data [Doh03].

The modern database environment has shifted from a DBMS (which managed single

records on a disk) to a segregated environment. Figure 2.4 shows the segregated

database environment consisting of four main components: The operational database,

the data warehouse, the data mart and the individual layer (or temporary database as

depicted in Figure 2.3).

Server Server Server Server

29

Figure 2.3: Visualisation of a shared disk cluster [Doh03]

Figure 2.4: The segregated database environment [Inm05]

As seen in Figure 2.4 there is a linear order to the flow of data between each component.

The reason for this is due to the nature of the data being stored after it has been

processed. The arrows indicate that the data is derived from a specific type of database

in order to store it in one more suited to its purpose. Therefore, the data mart and the

temporary database are smaller subsets of the operational and data warehouse. Each of

these components has specific characteristics related to what type of data can be stored

in order to address a specific need.

The operational level, as mentioned, contains the operational database, which has to be

a high performance system that allows transaction processing for many users. Typically,

Operational Database

Data Warehouse

Data Mart

Temp DB

Servers

Network

Data storage available on the network

30

the database stores primitive application data (data at the working level, such as a

client’s current account status).

The data warehouse level mostly contains “integrated, historical primitive data that

cannot be updated”. This means that it contains derived data that can show trends and

patterns, as it is time-based (for example, this can show a client’s payment and purchase

history over a specified period) data that can perhaps be used to produce a credit rating

[Inm05].

The data mart level contains data that is specific to a department within the

organisation and is tailored to their needs (for example, the Product department might

want to determine which products have the highest sales over a specified period in

order to determine which are successful and which are not).

There exists a fourth layer as well, namely the individual level, which contains

temporary, ad hoc data that end users require to perform an analysis on the data. For

example, the end user might want to know what trends he/she can foresee in the

different types of products that produce the most sales.

2.2.3 Viewing of Data

As mentioned previously, the distributed environment allows transactional data to be

kept separate from analytical data. The fundamental difference between the two is the

workload that they support. Operational databases (that process transactions) make

use of the classical OLTP system and handle the organisation’s core data. The sales

division of a department store will for example keep client account details and a

telecommunications company will need to keep track of client bill records and call data

records. In order to make sense of the data, it has to be compared and analysed

[Wre07].

Thus, business data needs to be organised in such a manner as to make analysis easy

and for this OLAP is used. It stores data in optimised structures commonly following a

multidimensional approach. Data is related in many ways, for example, sales are linked

to a customer and to a product. In OLAP, a structure like this will be compared across

31

each dimension. These complex structures will be explained in Chapter 4 [Sch10].

2.3 Conclusion

Database systems have evolved quite a bit since their inception almost half a century

ago. Data remain a core asset to any organisation, business or individual. Without data,

the IT industry would not exist, nor would any other industry be able to operate

effectively.

Many technical and architectural changes have been perfected throughout the past

century; and as a result, large amounts of data can effectively be managed by computing

systems.

The fundamental problem faced by organisations is the sheer volume of data that they

generate and house. They need all this data in order to make strategic decisions, but

there is simply too much to make sense of it. Thus, the analytical components were

moved to OLAP systems that represent data in multidimensional structures, making the

analysis of data easier.

32

1. Introduction

2. Data storage evolution

3. Database models

4. Data warehousing




12. Implementation

and Results




EPS4DM Model

33

Chapter 3

Database models

3.1 Introduction

Chapter 2 introduced the data storage techniques employed in some of the very first

information systems. These techniques mostly involved flat files that were tightly

coupled with a specific programming code to extract the required data.

In order to fully come to terms with data warehouses, data marts and other modern

storage facilities used within the data mining environment, some core database aspects

must be examined. This chapter provides an overview of database models and reveals

some of the problems that still face data storage today.

3.2 Database models

A relational database has to abide by the Information Principle. The Information

Principle, or better named, The Principle of Uniform Representation, simply states that

all information within the relational database is represented as, and only as, relations

[Dat08, Dat12].

To truly understand what a relation is and why it is needed, an examination of the

history of database modelling is required. A database model can be regarded as the

blueprint for a database. It defines the structure and organisation of data within the

database [Pow06].

3.2.1 The file system models

This section provides a summary of four file system models namely master files, the

hierarchical model, the network model and spreadsheets.

i. Master files

In Chapter 2 it was stated that the earliest form of data storage was the so-called master

or flat file. The term flat file means that all data is kept within a simple file that has no

34

topological structure. These are often text files, but they need not be. Earlier

implementations used binary files, which took less storage space [Pow06].

The difference between binary and text data is essentially only the way in which it is

encoded, i.e. the format used to create the file. A text file is a file that is usually encoded

with the American Standard Code for Information Interchange (ASCII) format. ASCII

assigns 7-bit (binary digit) numeric values to characters, including the common

alphabet, punctuation marks and numbers. The character ‘a’, for example, is

represented by the numeric value 97, which has the binary value of 1100001. Other text

file encodings include UTF-8 and on older mainframes the Extended Binary Coded

Decimal Interchange Code (EBCDIC) format was used [Mic02]. Thus a text file, which

can be viewed as plain text on a computer, is not encoded as the text itself but as

mapped numeric values.

A binary file uses full bytes for each ‘unit’ of information as opposed to the 7-bit ‘unit’ of

ASCII. Each program created its own binary file. These files were often unique to that

particular program. When text editors view files of this nature, they attempt to map the

numeric values to characters (by using a default character encoding), often ending up

displaying a multitude of illegible system characters. Binary files also often contain

header information, which defines the data found in the file [Mic02]. Powell states that

this kind of database is often called a file system database, as it consists of many files

that are stored on disk drives. Any topology between data in the files (i.e. relations) has

to be searched for programmatically by the applications that use them [Pow06].

Table 3.1: Example flat file showing login times of employees

Record Name Gender Department Last Login 0001 Joe Smith Male Finances March 12, 2010 08:30:45 0002 Susan Milian Female HR March 12, 2010 08:31:33 0001 Joe Smith Male Finances March 13, 2010 07:33:41 0001 Joe Smith Male Finances March 13, 2010 12:45:35 0002 Susan Millian Female HR March 14, 2010 09:30:32

In Table 3.1 a sample flat file is presented and it is clear that the file contains data that

has been duplicated.

35

Authors refer to this phenomenon as Data Redundancy and it will be discussed in

Section 3.2.2 [Rob07].

ii. Hierarchical model

The hierarchical model is a tree-like structure which contains logically inclusive data.

The tree structure is inverted with the root node at the top, and child nodes branching

from it. It is important to note that each child node can have only one parent node, but

each parent node may have multiple child nodes. The one-to-many relationship between

parents and children is introduced here [Pow06, Ste09, Dat12].

Figure 3.1: Example hierarchy of organisation [Pow06].

A typical example of a hierarchy is the organisational organogram shown in Figure 3.1.

In terms of computing the file structure on the disk drives is another example of a

hierarchy. It typically contains directories and folders that logically group related files

together. Another good example is the Microsoft Windows Registry, which is composed

of so-called hives of data that store system and application related data in a hierarchical

structure [Ste09].

Halpin states that data stored in child nodes are groupings known as records. A record

Organisation

Department

Manager

Project

Task

Employee

36

is a collection of fields (or facts) that group an entity together. Figure 3.2 depicts a

typical record structure containing the data types for each field of the employee

information shown in Table 3.1 [Hal08].

iii. Network Model

The network model is essentially a hierarchical data model with an exception to the

parent-child rule. This being that the parent–child relationship also allows for many-to-

many relationships. This implies that a parent may have multiple child nodes and a child

multiple parent nodes [Pow06].

Figure 3.2: Example record structure of an employee [Hal08]

According to Stephens, the network model can also be seen as a directed graph as the

existence of the many-to-many relationship allows for complex structures to be created.

Network database models are rarely used today and a more natural use for this data

model is internal memory representations for applications which examine problems

relating to graphs. One such an example would be the Travelling Salesperson Problem

(TSP), which aims to find the shortest path in a graph connecting towns as seen in

Figure 3.3 [Ste09].

Figure 3.3: Example graph of cities showing distances [Ste09]

inte

ger pers_nr

strin

g emp_name char gender

strin

g dept

times

tam

p login_time

0001 Joe Smith M Finances 12/03/2010 08:30:34

A B

C

D E

12

14

9

16

12

11

37

iv. Spreadsheets

Although spreadsheets are not databases per se, they do form a fundamental part of the

background of computer data storage and manipulation. A spreadsheet contains tabular

data in the form of rows and columns, a similar format to a database table, but stored as

a binary or ASCII flat file. This file is often application specific. Spreadsheet software

allows the end user to manipulate the data by creating formulas and charts, and to

export the data to other formats. It also supports part of the analysis by means of

statistical functions or solution finding tools [Ste09].

Spreadsheets are a common tool of choice for financially oriented people. Before the

late 1970s people had to rely on mainframe computers and normal hand-held

calculators to perform any financially based operation. The necessity of having a

singular tool to perform these calculations drove its creation. It was in 1978 when Dan

Bricklin and Bob Frankinston created the first electronic spreadsheet, named VisiCalc.

Much of their original ideas, such as the row and column layout and syntax used to

compile formulas, formed the basis of modern spreadsheets [Wal07].

A user can download the original VisiCalc and will still, after 30 years, run in an MS-DOS

window in a modern Microsoft® operating system, such as Windows XP. Other major

spreadsheets of note are the Lotus 1-2-3, Quattro Pro and Microsoft® Excel. The latter

product, Excel, was designed for the Macintosh and was released in 1985 as a successor

to MultiPlan – Microsoft’s first spreadsheet that was launched in 1982 on the MS-DOS

platform [Wal07].

3.2.2 Disadvantages of Traditional File System Models

The data models described to this point rely on the traditional file system paradigm,

where groups of separate files are managed independently, regardless of which data

model it employs [Gup09].

The most notable disadvantages of traditional file systems are [Gup09, Bur10]:

i. Data Redundancy – This arises when data has to be replicated across separate

files to allow the applications using them to function properly. This increases

38

storage requirements and more complex programming to keep all relevant files

up to date. Customer details, for example, might need to be saved in the customer

as well as the sales file.

ii. Data Inconsistency – Data redundancy will inevitably lead to data

inconsistency. Data inconsistency occurs when, data relating to a specific entity

are not updated in a consistent manner. This will result in one file having

different and conflicting data when compared to another file. Over time, this can

lead to serious degradation of the quality of data.

iii. Data Integration – When queries on data that reside in multiple files need to be

handled, complex programming needs to be done in order to extract and compile

the information needed.

iv. Data and Program Dependence – The physical location of the files that contain

data needs to be programmatically assigned, thus the application is dependent

on its data source. The files are also program dependent in the sense that the

application relies on the structure and organisation of data within a file.

v. Data Control and Sharing – Since multiple files contain data about the same

elements, and these files reside in different places for each application’s use, the

system is decentralised. Due to this decentralised nature, the sharing of data

between different applications once again requires the creation of specialised

programs which compile the data required.

This list is by no means complete, but only serves to show the problems that needed to

be solved by future models.

3.2.3 The Relational Model

i. The father of relational databases

It was the pioneering work of E.F. Codd, an IBM researcher, who in 1970 successfully

addressed the issues posed by the traditional file system databases. The original idea

proposed by Codd in his revolutionary paper “A Relational Model of Data for Large

39

Shared Databanks” was to build smaller repositories of data from a larger set, and link

them via relations. He proposed that subset data be stored in a relation, with fields and

entities [Pow06].

Codd posed 12 principles to which a relational database needs to abide. Looking back at

what was stated in the opening paragraphs of this chapter, the Information Principle is

in fact Codd’s first rule (also iterated below).

Davidson et al. presents a practical overview of Codd’s 12 rules as implemented by

modern RDBMS (Relational Database Management System) vendors. The major players

include Microsoft® SQL Server and Oracle 11g. The Information Principle and the

Guaranteed Access Rule are regarded as the fundamental rules governing modern

RDMSs. These two concepts will be discussed next [Dav08, Dat12].

a. Information Principle – All data that is stored within a relational database must

be represented in the same way, and that is to store it in a column format. Data

may only be accessed by referring to the column name of the table in which it

resides.

b. Guaranteed Access Rule – Any atomic value of data will be accessible by

referring to the table name in which it resides, the primary key of its record entry

and then the column name. This implies that there is no order on either the

columns or rows of a table. Columns can be shifted around within a table and not

affect how it is retrieved. Rows are uniquely identified through their primary

keys and their order within the table is therefore arbitrary.

ii. Database relational theory

Before a rundown is given of how modern relational databases work, the fundamentals

of database relational theory need to be covered.

a. Heading

A heading is a collection of attributes. Where an attribute consists of a name, A, and a

data type, T. The rules governing a heading are:

40

Each attribute in the heading is distinct in its name A.

Each distinct attribute is of the same data type T.

The number of attributes in the heading is referred to as the degree of the

heading [Dat08].

An example heading for a table storing a student’s data might be:

{< 퐼퐷, 푖푛푡 >; < 푁푎푚푒, 푠푡푟푖푛푔 >; < 푆푢푟푛푎푚푒, 푠푡푟푖푛푔 >; < 퐶표푢푟푠푒퐼퐷, 푖푛푡 >}

b. Tuple

A tuple can be defined as an ordered collection of elements. When the tuple has exactly

n elements it is referred to as an n-tuple and mathematically it can be represented as

(푥 , … ,푥 ). Continuing our earlier example of the student table, an n-tuple of length 4 is

for example: {1, 퐽표ℎ푛,퐷표푒, 22}

Let T = 푠 , 푠 be defined as sets. The Cartesian product of 푠 and푠 is defined as:

푠 × 푠 = {(푥,푦)|푥 ∈ 푠 and푦 ∈ 푠 }

Hence the Cartesian product is the set of all ordered n-tuples where each element comes

from the corresponding set [Dat08]. Thus for the collection of sets 푠 , . . . , 푠 :

푠 ×. . .× 푠 = {(푥 , … , 푥 )|forall푖in(1 ≤ 푖 ≤ 푛),푥 ∈ 푠 }

c. Relation

A relation, 푅 exists if it is a subset of the Cartesian product 푠 ×. . .× 푠 . R is hence an n-

ary relation over 푠 ×. . .× 푠 . Where n=1 it is referred to as unary, n=2 as binary and

n=3 as a ternary relation [Dat08].

iii. Modern relational databases

Now that a fundamental understanding of relations has been established, its application

within modern relational databases can be examined.

Table 3.2 compares important modern database terminology at the design and

41

implementation level to the original terminology. Reference will be made to modern

terminology within this section [Dav08].

Table 3.2: Naming conventions

Relational Theory Design Model Implementation Model

Relation Entity Table

Tuple Instance Row

Heading1 (Collection of) Columns (Collection of) Columns

Heading entry Attribute Field

In light of the naming conventions introduced in Table 3.2, the remainder of this

dissertation will make use of the Implementation Model convention.

d. Candidate Keys

It has been established that a table may not have duplicate rows. The way to implement

this within a database is to ensure that each table defines a candidate key as one of its

attributes [Dat08, Dav08, Dat12].

A candidate key is an attribute (or combination of attributes) that can uniquely and

unambiguously identify each instance of an entity. To enforce the Guaranteed Access

Rule a candidate key is needed to avoid duplicate instances, and to correctly locate an

instance within the database, as the ordering of elements in a table is arbitrary. There is

no upper limit to the number of keys that a specific table element may need in order to

maintain its uniqueness [Dav08].

e. Primary Keys

The primary key is the candidate key which is chosen to uniquely identify each row. As

mentioned, a candidate key may be a combination of attributes that uniquely identifies

the row. Any candidate key chosen beyond the primary key is referred to as an alternate

key.

1 Heading information for modern databases is found within the System Catalog.

42

f. Foreign Keys

Foreign keys are used within a relational database to ensure that:

1) Duplicate values are not present within tables.

2) Links exists between entities. A foreign key is a candidate key of an entity which

establishes a link between that entity and one or more other entities. This ensures that

when one entity requires data from another, it will only require the key to access this

information, eliminating the duplication The foreign key will most commonly link to the

primary key of another entity, but it may, alternatively, link to an alternate key, in which

case it is called an optional foreign key [Dav08].

It is worth mentioning that according to Codd’s 12th rule, the Integrity Independence

Rule, every non-optional foreign key must be accompanied by an existing primary key in

the entity that it is related to. This means that the foreign key cannot enforce referential

integrity if the primary key it relates to does not yet exist. The row, once linked, may

not be deleted without deleting all the rows referencing the primary key first [Dav08].

Furthermore, modern relational databases may not require the creation of a primary

key for the data within a table, thus not explicitly enforcing Codd’s rules. This is

required in situations where data needs to be imported from external sources such as a

text file or a spreadsheet file, that do not contain any topological structure. The entire

row is then regarded as a primary key. The raw data can be loaded into a database table

and are easily manipulated there in order to transfer the data to a relational table. The

DBMS will treat the entire row as a primary key for the imported data [Dav08].

g. Cardinality

The use of foreign keys within a database environment, introduces another concept,

namely that of cardinality. Cardinality simply indicates how many instances of one

entity relates to another [Dav08].

i. One-to-Many (1:M) – These are the most common types of relations found

within a database environment; this is where one instance of an entity may link

to any number of instances of another entity. An example is the use of one

43

company vehicle by many employees [Rob07, Red12].

ii. One-to-Exactly N (1:N) – Less commonly used, these rely on constraints being

setup; such as one employee who has exactly 3 email addresses in the company.

The most common is the one-to-one relation, which relates exactly one instance

of an entity to only one other instance of an entity [Rob07, Red12].

iii. Many-to-Many (M:N) – These are real life relations, which link more than one

instance of an entity to many other instances. By way of example: In an

organisation, employees have many job skills, and in turn these job skills can be

learnt by many employees. Thus, many skills can be learnt by many employees

and vice versa. These relations cannot simply be implemented in a relational

data model; they require the use of a bridging table. The basic idea is that instead

of the candidate keys (foreign key in one table linking to primary key of another)

linking to each other’s tables, they are both placed in a new table, which must

contain these candidate keys that sometimes have additional summary

information [Rob07, Dav08, Red12].

The process of normalisation eliminates duplicate entries in data and establishes the

correct relations between the different entities. This process will specifically separate

the data into the correct number of tables, create the correct bridging tables and

introduce the required candidate keys [Rob07, Red12]. Chen’s Entity-Relationship (ER)

model allows the modelling of entities and the relationship which they have with each

other. An entity is defined as something that is capable of existing on its own within a

system and could be a real-world physical object such as a person. An entity can be

thought of a noun [Hal08].

Figure 3.4 shows the use of a foreign key. The employee Joe Smith is linked to his job

description (JD) by referring to the JD primary key in the personnel table. (Job_Desc_ID

is the primary key of the job description table, and is used as a foreign key in the

personnel table.)

3.2.4 Object-oriented model

The fundamental idea behind a computer object is to model the real-world

44

environment. In the object-oriented (OO) data model, an object represents an entity that

is similar to an entity that contains the data required to describe it. However, it also

contains information about the object’s relationships to other objects.

Figure 3.4: Example use of foreign key

The data contained in the object is given the semantics (description of its data or meta-

data); and hence this model is commonly known as a semantic model of data [Rob07].

A computer object is said to be an encapsulation of both its data and the operations that

can be executed on that data. This has been extended to object-oriented databases,

therefore an object in the database also contains valid operations which can be executed

on it. These operations are called methods. This model supports the concept of

inheritance, whereby a child object may inherit attributes and methods from its parent

object, this ensures greater data integrity [Rob07].

Figure 3.6 depicts the personnel data that was presented earlier, encapsulated as an

employee object.

The first block represents its attributes, the second block its relations (the link to the job

title as shown in Figure 3.4) and the last blocks, the operations that can be performed

on the object. An inheritance structure is also depicted showing the different types of

the base object Employee.

45

Figure 3.5: Example of a normalised database structure

46

The use of this type of database is only recommended if it is used in conjunction with a

modern object-oriented programming language. The object-oriented database is also

managed by an Object-Oriented Database Management System (OODBMS) [Ste09].

Figure 3.5 depicts the employee from Figure 3.4 as a lecturer who gives courses that

many students can take.

3.2.5 Object-relational model

Object-oriented databases face a great challenge as they require a mind shift from the

traditional view of data, and this introduces problems.

Figure 3.6: Example OO data model

The idea behind the object-relational model is to combine the best traits of both data

models. The best features of the object-oriented data model have been incorporated

47

into the object-relational model, while still maintaining the familiar relational structure.

An Object-Relational Database Management System (O/RDBMS) manages the object-

relational database [Rob07, Red12].

A programming language construct known as object-relational mapping (ORM) is

gaining ground in modern systems development. It is used to map between program-

specific object-based data and the relational data within a traditional relational

database. It is essentially a layer between the program code and the data in the

database. The main motivation behind this is that it simplifies and speeds up

development. It leaves complex SQL queries to be constructed by the ORM layer –

freeing up the developer from having to perform complex join operations to extract

data. Frequently used ORM frameworks in the Java programming environment include

Hibernate and iBATIS [Ste09, Hib13, Iba13].

3.2.6 XML data model

XML (Extensible Mark-up Language) is a language designed specifically for representing

data. Its fundamental design was to store hierarchical data, but it is also used to store

data of a tabular format [Ste09].

An example of the personnel data presented can be represented in XML in the following

way:

<Employee JobTitle=”Accountant”>

<PersonnelNumber>0001</PersonnelNumber>

<Name>Joe Smith</Name>

<Job_Desc_ID>001</Job_Desc_ID>

</Employee>

XML is the fundamental protocol used to transfer data between networked and

internet-based systems. Modern databases, such as Oracle 10g incorporate support for

XML in order to store and manage “unstructured” data of this type. Unstructured data is

48

seen when no explicit relations exist [Fel07]. Native XML databases also exist, which

only store and manage XML-based data. In the section on the modern relational

database, it was mentioned that text files and spreadsheets are used to import data into

relational databases. This provides an alternative, whereby XML data can be imported

and managed separately or translated to relational data [Rob07].

3.2.7 Data manipulation

The language needed to access the data within the database is an important aspect of

database theory. According to Codd’s 5th rule, a language that can perform and enforce

“data definition, view definition, data manipulation (interactive and by program),

integrity constraints, and transaction boundaries (begin, commit, and rollback) is

needed” [Dav08].

The Structured Query Language (SQL) – pronounced “sequel” – consists of a Data

Definition Language (DDL) and a Data Manipulation Language (DML). A DDL is used for

operations such as defining the structure of tables and their relations to each other, for

instance Create Table or Alter Table. A DML is used for all operations on data in a

database. It has four fundamental statements, namely: Insert, Select, Update and Delete.

In database communities, these are informally referred to as CRUD – Create, Read,

Update, and Delete [Dav08, Red12, Cel12].

The following SQL statement will retrieve Employee 0001’s Name from the Employee

table, as depicted in the previous examples:

Select Name

from Employee

where Personnel_Number = ‘0001’;

Update Employee

Set JobTitle = Manager

where Personnel_Number = ‘0001’;

49

3.3 ANSI/SPARC architecture

In the 1970’s, the American National Standards Institute (ANSI) Standards Planning and

Requirements Committee (SPARC) defined in what way data should be viewed as layers

of abstraction. They essentially defined four views (or abstractions) of data, namely, the

External model, the Conceptual model, the Internal model and the Physical model as

depicted in Figure 3.7 [Rob07].

External model – This refers to how the end users of the systems see the data,

these are often business users.

Conceptual model – This is a global graphical depiction of the database

structure, and is usually an ERD – Entity Relationship Diagram.

Internal model – The conceptual model needs to map to the internal model used

by the specified DBMS. For example, a RDBMS uses relations and an OODBMS

uses objects.

Physical model – This is the lowest level, which defines how data is to be stored

on physical mediums such as disk drives. It is “a software and hardware-

dependant” layer that has to conform to the standards used by the DBMS, the

native operating system and the storage hardware provided by the system

architecture [Rob07, Gra12].

3.4 Important database concepts

This section will discuss the physical layer of the ANSI/SPARC architecture. Each sub-

section introduces a concept used within a DBMS to manage and store its data.

50

Figure 3.7: ANSI/SPARC architecture

The physical layer is described in more detail in Section 3.4.

3.4.1 Schema

A database schema can be seen as the first level of security. A database schema is a

grouping of related database objects (such as tables, indexes and views) which only

authorised users can access and view. An entire organisation’s database tables can be

contained in one physical database and not every user may have access to view all the

data in all the tables. It should be mentioned that the nomenclature referring to the

concepts described here is vendor specific. Some DBMSs use the term schema (for

example Oracle®) and others make use of the term users (for example MySQL and

PostgeSQL) [Rob07, Ora11].

3.4.2 Virtual Table (View)

A database view is a query defining an expression that references a set of tables. The

expression is a select statement that is used frequently. This query will be stored in the

database as a view, which defines a virtual table for the data. Its output changes each

time the referenced table’s data is updated. Native Insert, Update and Delete operations

to columns referenced by a view are not supported, only a Read operation is allowed.

External Model

Conceptual Model

Internal Model

Physical Model

End Users

DBMS

Database Designer

Low-level System

51

However a database trigger could be setup which can handle Insert, Update and Delete

operations on a view [Dat08, Rob08, Ora11, Dat12].

3.4.3 Metadata

Although the database environment is not the only place where metadata might occur, it

is of vital importance in database systems2. The often non-explanatory definition of

metadata – which is said to be data about data – never describes its real use or purpose;

and is a term that often creates confusion in the computing domain. There are three

metadata classifications, namely:

Descriptive metadata – This is identification data, which describes how other

data is to be discovered and searched for. An example is a library catalogue

system, used to index and locate books.

Structural metadata – Objects are often super-objects made up of other objects.

This describes how each object is put together using the other objects.

Administrative metadata – Data that almost always needs to be managed, thus

this will provide information describing technical details, access rights, file

creation data and so forth [Gup09].

3.4.4 Data dictionary

The data dictionary stores metadata about the database and is often automatically

managed by the DBMS. The following types of metadata is often stored:

Physical database design – This describes technical details such as file names,

access paths and which storage data structures are used. It also provides a listing

of all tables found within the database.

Database schema – A complete description of the schemas that can be found in

the database is maintained together with valid users and their associated access

rights and views. [Gup09].

2 Metadata such as the schema, data dictionary and system catalogue is often stored within the RDBMs as relational tables.

52

3.4.5 System catalog

The system catalog can be regarded as a more detailed data dictionary. It describes all

the objects found within the database in detail, thus for a given table, it will describe the

table header, data types, column counts, creation times, and last update information to

name a few [Rob07].

As a result of the interwoven nature of the data in the data dictionary and system

catalog, modern DMBSs often produce only a consolidated system catalog. The terms

system catalog and data dictionary may be used interchangeably in modern database

terminology and both form part of the ANSI/SPARC conceptual and internal layers

[Rob07].

3.4.6 Indexing

The concept of indexing has been in existence for centuries: books have indexes and

page numbers and filing systems often have tabs to help users find what they are

searching for. Indexes consists of (key, pointer) pairs and are often stored in tree data

structures [Lig07].

A B+ tree is often the data structure of choice for modern DBMS indexing, and will be

described briefly:

Non-leaf index nodes contain 푝 tree pointers and 푝 − 1 keys. The reason being

that when a search is done keys are compared by either being smaller, equal to,

or greater than each other. Suppose a search is done for key 푘 by comparing

푘 to the key 푘 in the node: if 푘 is smaller than푘, then the pointer logically

located at position 푘 will be followed. Therefore, there is a pointer to the left

and right of every key that can point to a new non-leaf or leaf, which can reduce

the search space considerably [Lig07].

Leaf index nodes differ in the sense that they consist of 푝(key, data pointer)

entries and a pointer to the next leaf node. Leaf nodes are ordered in such a way

that they sequentially reference each row of data, and a data pointer is more

properly called a Record Identifier (RID). Thus if 푘 matches key푘, the RID at the

53

position푘will be followed to access the row of data required. If none of the keys

match, it means that a top-level index was routed to the wrong node and that a

linear search will be followed to find the key in the subsequent nodes. Although

the main reason for this ordering is not the fault tolerance of finding a row, but

actually the quicker finding of an ordered sequence of rows [Lig07].

As depicted in Figure 3.8, it is the intermediary non-leaf nodes that allow the data

indexes to be found quickly without any disk I/O overhead occurring. Indexing is part of

the ANSI/SPARC physical layer.

Figure 3.8: Example of an indexing tree [Red12]

3.4.7 Transactions and concurrency control

A transaction is a real-world unit of work, and it is a grouping of statements that can be

executed on the database. Capturing a sale will for example require entries in more than

one table and might cause updates to other data in related tables. Most operational (or

more aptly referred to in this instance as transactional) databases perform

transactional processing.

Transaction management is subject to the ACID (atomicity, consistency, isolation,

durability) principle, whereby it either completes all tasks or nothing. The ACID

principles will be explained in more detail later in the chapter [Gup09, Wya13].

Multiple applications will try to perform transactional processing simultaneously; and

might even work on the same data at the same time [Rob07]. Possible problems that

54

may arise, according to Gupta are:

Lost updates – Consider two transactions working on the same data. At the

same time transaction 푇 and 푇 read the value of field mark-up. Transaction 푇

uses the data in some calculation and updates its value. Transaction 푇 also

processes the data and commits its changes. The update made by 푇 is lost as a

result.

Uncommitted dependency – Consider transaction푇 which changes the value of

variable mark-up at time 푡 . At time 푡 , transaction 푇 , reads mark-up to calculate

actual cost. However, at time 푡 , transaction 푇 is rolled back as there was a

failure, thus the value read by 푇 is now invalid [Gup09].

Inconsistent analysis – Assume transaction 푇 is busy updating a set of data,

and updates entry 푒 at time 푡 . At time 푡 , transaction 푇 starts to perform an

aggregate calculation over the dataset updated by 푇 and uses entries 푒 and 푒 .

At time푡 , transaction 푇 is completed by updating 푒 ; and thus the aggregate

calculated by 푇 is based upon inconsistent data [Rob07].

Concurrency control is the technique used to control the concurrent execution of

transactions. It makes use of locking techniques3, where one transaction is in a locked

state until another has successfully completed and all involved statements have been

committed [Gup09].

For this reason, the management of all transactions is subject to the ACID (atomicity,

consistency, isolation, durability) principle. The main concepts are briefly summarised:

a) Atomicity – A single transaction can be seen as an atomic unit, either the entire

transaction is processed correctly or nothing is processed. If a failure at any level

(hardware, software, etc.) should occur – the state of the data must remain

unchanged. 3 Database deadlock can occur when one transaction is waiting for data which is still locked by another requesting transaction. For example a database session A requires the table Y to operate and hence acquires a lock on it. Session B requires table Z to be locked and does so, but also requires table Y to perform an update operation on it. Until session A releases the lock on table Y session B will need to wait. If session A were to require table Z for an update operation but session B has it locked and is still waiting for operation A’s lock both sessions are in deadlock waiting for each other. All-or-nothing resource allocation should be implemented to not allow session B to acquire locks until all the resources it needs are available.

55

b) Consistency – At the very best the DBMS must be able to determine if a

transaction will violate a consistency rule4 and either decide to not process the

transaction or to take some action setup by the database designer. An example is

submitting an insert statement by omitting a mandatory non-null field, which

could either result in the transaction failing or the DBMS inserting a default

value.

c) Isolation – A fundamental data access rule which ensures mutually exclusive

access to the data that is being modified only by the process that is performing

an update. No other transaction may execute a read operation on the data in the

process of being altered.

d) Durability – This is a guarantee by the DBMS to ensure that any transaction that

sent a success message to a client is indeed committed and will not be lost in the

event of any system failure. Transactions are usually written to a transaction log,

which can be used to re-create the state of the database in the event of any

system failure before the failure occurs [Ste09, Wya13].

3.5 Specialised database models

The following section presents a few other data models, but this is by no means an

exhaustive listing:

3.5.1 Document-oriented model

The document-oriented database5 is used in instances where users interact with

physical documents - they might want to view them or make amendments to them. A

normal file system does not provide the adequate searching, sorting and concurrency

control that is provided by proper databases. This database might store the contents of

the documents themselves, or merely have links to physical files on a disk [Ste09].

4 Changes to the database should always move it from one consistent state to another consistent state. 5 MongoDB and CoachDB are common document-oriented database implementations [Nos13].

56

3.5.2 Temporal model

A temporal database is one that explicitly keeps track of the validity of data. Each entry

will have a valid period associated with it. A good example is changing stock prices that

are depending on global events, for instance the 2011 Thailand floods that caused

computer hard drive prices to soar [Ste09, Itw11].

3.5.3 Spatial model

A spatial database consists of spatial data that describe the real world. In the real world,

there are either entities or continuous fields. An entity will represent a physical

occurrence such as a river, which has geographical coordinates and boundaries. A

continuous field is synonymous to a phenomenon that occurs within the world, such as

temperature changes or pollution levels within a defined boundary. A field’s data is

variable and will be updated from sensor readings or from experiential data [Man05,

Dar06].

3.5.4 Fuzzy object model

This model makes use of Zadeh’s proposed fuzzy set theory in order to represent

incomplete and unclear values imbedded into object attributes. Most real-world data

are incomplete and are often difficult to take into account. The set theory provides a

means by which this can be done [Zon05].

In the set theory, unlike the binary theory that defines truths as either true or false,

there is no exact truth. The state of an object is described relative to other states, thus

network speed might be measured in terms of slow, not slow, not fast, very fast, and so

forth. A set of IF-THEN rules are then followed to deduce actions for each state [Zon05].

3.5.5 Dimensional model

The dimensional data model deviates from the traditional relational model by

representing data in multiple dimensions of abstraction and will be discussed in

Chapter 4.

57

3.5.6 NoSQL model

The notion of NoSQL databases (or often referred to as not only SQL) is a collective term

for many different database implementation types. MongoDB and CoachDB are

document-oriented database implementations. Many other instances such a graph

databases, or Google’s wide column BigTable also form part of the NoSQL database

motion [Nos13].

3.6 Conclusion

Chapter 3 presented an overview of the relational database as being the database of

choice used by many application programmers when creating enterprise software. By

making use of object relational mapping (ORM), the use of relational databases have

been substantially made easier, allowing application developers to easily create

database driven applications.

The use of specialised database implementations and the latest technologies to solve the

problems inherent to the relational database does not solve all business problems. This

chapter has covered the modern approach to storing enterprise data, be it:

transactional, documents or geographical in nature. The data needs of organisations

change continuously and merely storing transactional and client related data is not

sufficient any more. Section 3.5 therefore presented specialised databases that allow

more complex data to be stored as part of operational systems. Chapter 3 has shown

that the modern enterprise environment consists of operational systems that need to

cater for a myriad of real-time transaction processing tasks, ranging from flight

bookings, credit card payments, call data record processing to the maintenance of client

related information at an insurance firm or bank. The nature of this data is complex and

many of it is generated by sub-systems that allow the enterprise to function. The

operational systems are entirely dependent on this data in order to service any user

requests that arise and without this data, the organisation would not be able to operate.

There is however a problem regarding the segregation of this data and the systems that

use it, namely that it is very closely coupled to the business unit which requires it and

thus it often benefits a smaller group of people who use application software reliant on

58

this data. Therefore, each business unit will have its own applications and data which is

required to function and there is often a limited global view of the data maintained by

all the business units.

Chapter 4 will present an overview of how the operational data that is used by various

business units stored in their operational databases and operational systems can be

made of use in the organisation as a whole in order to direct business decision making.

59

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

60

Chapter 4

Data warehousing

4.1 Introduction

Organisations today rely heavily on the use of relational databases, but as Chapter 4 will

show, they are not well suited for the analysis of data. Relational databases are

structured to allow OLTP systems to operate effectively to serve member requests.

The organisational business units will often require information pertaining to the

transactional data which is stored in the form of reports. Chapter 4 will provide an

overview of Online Analytical Processing (OLAP) that makes use of data warehouses

and related data marts in order to analyse the transactional data. Data warehouses are

not intended to replace operational databases such as relational databases, document-

based databases or other NoSQL databases described in Chapter 3, but rather to make

the data they store available for managerial decision making purposes.

Chapter 4 starts with an overview of Business Intelligence (BI), introduces the data

warehouse and its related concepts and concludes with an overview of the real-time

analytical capabilities of an OLAP system.

4.2 Business Intelligence

Business Intelligence (BI) is “having the right access to the right data at the right time”

in order to make informed business decisions. Data can be raw facts or information that

resulted from some kind of an analysis. In essence, this will allow management to make

decisions based on factual operational data within the organisation [Sta07, Fer13].

More formally, IBM defines BI as “the conscious, methodical transformation of data

from any and all data sources into new forms to provide information that is business

driven and results oriented” [Bie03]. The fundamental premise of BI is that all decisions

that management makes should be based on facts and not subjective opinions or

educated guesses. BI aims to provide the most accurate, up-to-date and relevant data at

all times.

61

This data is available for any person in the organisation who wishes to view it, not just

for management. [Lob09].

The database solutions discussed thus far are called Transactional or Operational

Databases and are often referred to as Production Systems. They most commonly (but

need not only) contain the transactional data of the organisation. It might for example

contain the latest sales and customer data. It would be impossible to simply keep all the

data ever captured by the organisation in the same database, hence older (or historical)

data is often removed from the production system after some time. Historical data is

however often the data required to make informed decisions [Sta07].

4.3 Data warehouse fundamentals

A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant

collection of data that is used to support managerial decisions. These aspects are

important fundamentals to data warehouses. The differences between these aspects

and those in an operational database, is presented in the following section. Formally, a

data warehouse as defined by William Inmon (known as the father of the data

warehouse) is an Analytical Database [Inm05]. The fundamental characteristics that

make up the data warehouse will be described in detail in the following section with

references made to the relational models as defined in Chapter 3.

4.3.1 Subject-oriented

Data in operational databases are categorised according to physical use. It reflects the

way applications use and require the data; it thus reflects the physical manipulation of

operational data. Another way of looking at it might be to say that it follows the

business processes within the organisation. Figure 4.1 gives an example of the subject-

oriented data in an organisation. It is clear from Figure 4.1 that the data warehouse

contains data related to subjects of the business domain, whereas the operational

database follows a process-based data approach [Sil08, Lab11].

The data warehouse is designed to answer questions from the disparate departments

within the organisation and is designed around the data required by these departments.

Thus, in Figure 4.1 the Sales data from the operational database might be stored as sales

62

per Customer in the data warehouse so that the sales department can quickly access the

data [Rob07].

Figure 4.1: Subject orientation [Sil08]

4.3.2 Integrated

The operational database is largely application-oriented, thus the data that it uses is

stored in formats required by the applications that use them. Multiple operational (or

sometimes other) databases feed data into the data warehouse , each of which contains

data in its own format. To avoid data translations and possible data lookups while

querying the data warehouse, the data needs to be represented in a uniform manner.

Data is therefore translated beforehand and the uniform data is then integrated into the

data warehouse [Inm05, Sil08, Kim13].

Data integration occurs on three levels, namely:

i. Form

This refers to the representation used to express the data; it refers to both the

syntactic and physical representation. These different expressions are translated

consistently into a single format for use within the data warehouse. In Figure 4.2,

there is a demonstration of the encoding of gender. Ultimately, this will allow data

queries across all stored subjects [Sil08].

The reason for the representational inconsistencies is that application designers and

developers are usually free to decide which representation their application will use.


Sales Receiving Manufacturing Distribution

Data Warehouse

Customers Vendors Agents

63

Figure 4.2: Encoding of gender data [Sil08]

It does not matter which final representation the designers or developers choose for

integration, as long as it is applied consistently across all data integrated into the

data warehouse [Inm05].

ii. Function

Function refers to the meaning of data. This is best described when considering

cryptic codes used within disparate units within an organisation. The IT division

might for example use the code er05 to indicate that there is a serious problem with

the connectivity to a server, whereas the procurement division might use that same

code as an internal tracking number. Figure 4.3 shows how different data

descriptions are translated to a common meaning describing the function. Data of

this nature will need to be integrated into a single data unit that is able to describe

the different meanings [Sil08].

Figure 4.3: Semantic translation of data [Sil08]

m/f m/f

1,0

b,g

Operational Data Warehouse

small

little

minor

minor

64

iii. Grain

Grain refers to the following two concepts pertaining to data:

a. Unit of measurement

This is simply the unit of measurement of the expressed data that needs to be

converted and consistently integrated into the data warehouse. Figure 4.4

depicts the conversion of data measured in different units [Sil08].

Figure 4.4: Conversion of data format [Sil08]

b. Hierarchical representation

Disparate departments in the organisation often maintain different

hierarchical representations of data, depending on the level of detail of

data they have available or require. When this data is integrated into the

data warehouse, these hierarchical groupings will be combined and thus

form a complete view of the data [Sil08].

Figure 4.5 depicts two different hierarchical representations for a time-

key instance based on month and year. The first example shows how one

department will group time by month and then by year. The second

example depicts how the months and years are grouped on the same level

in the hierarchy.

cm cm

inch

m

65

Figure 4.5: Time-key hierarchical differences [Sil08]

4.3.3 Non-volatility

Operational data is manipulated, record-by-record, at any time and by any authorised

application or end user. This ensures that it contains (in theory at least) the most up-to-

date, accurate data to reflect the current state of the organisation. A data warehouse

dramatically differs from this in the sense that [Inm05, Lab11, Kim13]:

Data is, usually, loaded on a large-scale batch process. This is referred to as

loading a snapshot of the current operational data into the data warehouse.

Generally, once data is loaded, it will not be updated or deleted.

Instead, if updated data is found in the operational database, a new snapshot of

the data is loaded.

The methods listed above ensure that the data warehouse contains a historical record of

the organisation over time. Figure 4.6 compares the use of an operational database to

that of a data warehouse.

Time_key

Month

Year

(12/Sept/12) Tk1

(13/Sept/12) Tk2

(23/Oct/12) Tk3

(24/Oct/13) Tk4

Sept/12 Oct/12 … Oct/13 …

2012 2013 …

Time_key

Month

Year

Tk1 Tk2 Tk3 Tk4 …

September October 2012 2013 …

66

Figure 4.6: The non-volatility of data warehouse data [Inm05]

4.3.4 Time variant

Due to the non-volatile nature of the data in the data warehouse, a resulting

characteristic is that of the time variance of data. This means that the data in the data

warehouse is accurate at some point in time. Time is an important concept in a data

warehouse, and can either be embedded into each record by means of a timestamp or

by having a date associated with every snapshot. Time horison refers to the length of

time that data is represented within the chosen data environment [Inm05, Lab11].

Projected data that is generated through some statistical modelling might also form part

of the data warehouse. The model uses the data it has available at a point in time to

generate projections of future possible data, this data is also time-variant in the sense

that it used data that was relevant at some point in time to generate this [Rob07].

4.4 Creating analytical data

This section gives an overview of how operational data is transformed into analytical

data for use within a data warehouse environment.

4.4.1 Selecting data sources

The first order at hand is to decide what data to use. There are essentially four sources

of data that can be integrated into the data warehouse:


Data Warehouse

insert

update

update

read

read

load

67

i. Operational data

The operational data of the organisation resides in many disparate database systems,

most of which are tailored to meet the needs of the applications and their users. These

database systems might include relational databases, other operational databases such

as those mentioned in Section 3.5, but they can also include legacy systems as well.

Ponniah states that the queries to data in an operational database are usually narrow in

the sense that only specific information is retrieved and viewed or manipulated at any

given time [Pon01].

ii. Archived data

Once operational data reaches the end of its time horison, it needs to be removed in

order to ensure that only current and consistent data is kept in the operational

environment. The time horison of data is no set measure and will vary per business

requirement. This is referred to as the archiving of data, and occurs periodically, as

dictated by the organisational policies.

Many different levels of archiving exist, namely [Van12]:

Moving expired data to an archive database, that is still accessible online.

Moving data from the archive database to flat file storage on mass data storage

mediums such as disk and tapes.

Moving the disk stored data to external storage and possibly keep it off site.

It needs to be determined beforehand what archived and historical data will be loaded

into the data warehouse in order to give enough information, but not to overload the

warehouse [Pon01].

iii. Internal data

Most departments within the organisation will keep track of the data that they, and only

they, use and produce. The organisation as a whole might not even be aware of the

existence of this data, but it could possibly be useful in the data warehouse

environment.

68

These are often stored in:

Documents

Spreadsheets

Departmental Databases

A departmental database is often referred to as a data mart, and will be discussed later.

The data in documents and spreadsheets present the biggest challenge as they need to

be integrated by making use of applications to convert the data formats or, worse still,

manually [Pon01, Van12].

iv. External data

From a management perspective, an organisation’s overall position in the market can

only be determined when compared to their competitors or other influential forces in

their market environment. It is therefore essential to integrate data from external

sources into the data warehouse environment. These data sources are volatile, in the

sense that they can be updated hourly (stock markets), monthly (car sales) or annually

(financial results). Depending on when and where the data comes from, the loading of

external data needs to be planned carefully [Pon01, Van12].

4.4.2 Integrating the data

As mentioned in the previous section, the largest problem faced in building a data

warehouse, is that the data from disparate sources are largely unintegrated. The first

problem is the matter of how to interface with all the data sources: if there are legacy

systems, new interface and translation applications might need to be developed. If the

operational data comes from multiple technology platforms, all these platforms will

need to be bridged in order to access and translate the data.

The second problem is integrating the data. As mentioned in the previous section, data

needs to be integrated across form, function and grain. Once a better understanding of

the data sources have been established, matters such as semantic data translation can

occur.

69

The third problem, according to Inmon, is to determine how to efficiently access all the

data sources and import the data to the data warehouse. The following is a suggestion of

how the data should be loaded [Inm05, Van12]:

Archival data – This is not always loaded as organisations might decide that it is

of no use. If it is loaded, however, this will be loaded only once and that is done

when the data warehouse is created.

Internal data – If internal data from other sources need to be loaded, special

arrangements need to be made to load this data, it might be a once off or on a

continual basis.

Operational data – The loading of operational data occurs in two stages, the

first is to load the operational data environment (only that deemed usable for

data warehouse purposes) as it is when the data warehouse is created. The

second stage is the continual loading of updated operational data and will be

discussed next.

External data – If the organisation decides that external data is to be loaded, the

same approach than for operational data will be followed, the only exception

being that the uploads might happen on an ad hoc basis, as it becomes available

and is deemed useful to incorporate.

It needs to be mentioned that data is usually not pulled directly into the data warehouse

from the sources, but is usually loaded into a separate data preparation environment,

where the mentioned refinements need to be done. Once successfully completed, it must

be loaded into the data warehouse environment [Van12].

4.4.3 Loading Operational data

Operational data changes on an ongoing basis. A major problem, according to Inmon, is

the manner in which one can determine what data have changed with relation to the

data that is currently in the data warehouse. This can become a serious performance

bottleneck, as data sources are often very large due to organisations which store vast

70

amounts of data to support operations.

It should be clarified that no direct updates are performed on the data in the data

warehouse, it is merely amended with new data from operational sources, and the old

data remains intact. Techniques that can be used are the following [Inm05, Bul12]:

Timestamp – As stated in Section 4.3.4, data stored in operational environments

usually have a timestamp or some element of time associated with them.

Delta file – Some applications that relate to transactional processing might

create a delta file that logs the latest changes made in the operational data since

the last time the file was cleared. Delta files are efficient, since only items logged

in the file will be considered.

Log file – Most transactional applications will create a log file, which summarises

the activities done in a particular timeframe. A log file differs from a delta file in

the sense that it contains more metadata as it is often used for recovery

processes.

The log file might also have a very different and compressed internal file

structure.

Modify applications – Current applications, such as producing a delta file, might

be modified to enhance the data selection process. This is, however, a high-risk

activity as tampering with the old application code might lead to disastrous

ramifications that were not anticipated. A good idea might be to ensure that all

new applications developed comply with data extraction requirements6.

Snapshot comparisons – As a last resort, when data is loaded a snapshot of the

state of the operational data can be taken. Before performing a needed import, a

new snapshot is taken of the current operational data, and these two are

compared to each other to determine which transactions differ. This is by no

6 The suggestion to ensure that new applications adhere to data extraction requirements will not solve the legacy application problem. The lessons learnt from having legacy applications that cannot serve the data required should be applied to designing new applications that have reporting and data extraction in mind.

71

means an efficient process and should be considered only when all other

attempts have not produced the desired results.

4.5 Data warehouse and data models

There are two architectures which guide modern data warehouse design and

implementation, namely:

4.5.1 Inmon’s top-down approach

Inmon’s model is one of the predominant approaches to follow and is based on the

creation or pre-existence of an enterprise data model that represents the enterprise as a

whole. The model that represents this is nothing other than a classical normalized

relational model used within OLTP, as described in the previous chapter [Lob09].

The main data warehouse’s model is enterprise wide and is thus not well suited to

handling queries made from certain departments. When departmentally specific data

has to be compiled frequently, it is best to create data marts to serve the specific

requirements [Lob09].

i. Data mart

A data mart can be defined as a single-subject subset of a data warehouse that is used

within departments for their analytical analysis only [Rob07].

There are two types of data marts:

Independent data mart – An independent data mart is one which is created by

the users within a certain department. There is no need to think globally when

creating one [Inm05].

Dependant data mart – A dependant data mart is one that is created from the

data within a data warehouse. Some planning is required as the data pool

contains data from various departments. The designers need to have knowledge

of the larger data warehouse to create one [Inm05].

72

ii. Inmon’s data warehouse

Figure 4.7 depicts Inmon’s approach: in this instance dependent data marts are created

to aid in ad hoc query performance. This approach is referred to as top-down, because

of the constraint that an enterprise wide model already exists or can be created before

the data warehouse’s inception [Lob09].

Figure 4.7: Inmon’s data warehouse architecture

The approach that will be mentioned in Section 4.5.2 should not be confused with the

notion of a data warehouse being constructed from a combination of independent data

marts, a process also referred to as the “bottom-up” approach, but still related to

Inmon’s approach.

4.5.2 Kimball’s bottom-up approach

The fundamental problem with Inmon’s approach is the requirement that an enterprise

wide model must be created before the data warehouse is created. This is an enormous

undertaking and often not feasible. A more favoured approach is to incrementally build

and expand the data warehouse. Kimball proposed an alternative to Inmon’s approach,

consisting of the following four key points. This will be discussed next [Lob09].

i. Dimensional modelling

Dimensional modelling is an approach based on the way in which the end user sees

File

Spreadsheet

Database

Data Mart

Data Warehouse

73

data. The end user will always see data as it relates to various dimensions. A sale might

for example be related to a customer and a product or service. Kimball states that, to

achieve this view, data is divided into what are called “measurement” and “context”

[Kim08].

A measurement is a tangible value and is usually a numeric value or date. These

measurements are the core data that is captured in transactional systems, and are

referred to as facts and stored in a fact table in the dimensional model [Kim08, Kim13].

A fact on its own is meaningless; it needs to be put into context. Kimball states that

context is regarded as dimensions in the dimensional model and describe the "who,

what, when, where, why, and how" of each fact. The dimensions are stored in a

dimension table. Multiple dimension tables will link to a fact table in order to give the

stored facts meaning [Kim08, Kim13].

ii. Star schema

As mentioned in Chapter 3, the process of normalisation is applied to tables in the

relational model in order to remove the replication of data, for transactional processing

performance and storage efficiency purposes. In data warehousing, the main goal is to

ensure that the data is stored in such a way as to model the way users perceive the data.

Relational tables are often denormalised into fact and dimension tables that contain

replicated data [Kim08, Ada10, Kim13]. Figure 4.8 shows a typical dimensional

structure, aptly referred to as the star schema, due to the fact that the dimension tables

surround a fact in the table in such a way that the resulting structure resembles that of a

star shape. The fact and dimension tables are stored in a normal RDBMS and are indeed

relational tables. When the dimensional model is used to store data in an OLAP

multidimensional data environment, the resulting structure is called a cube [Kim08,

Ada10].

74

Figure 4.8: Dimensional modelling – star schema [Kim08]

iii. Snowflake schema

In the dimensional model, a dimension is related to one table. Sometimes a dimension

table can become very large, thus an attempt to address this, is to remove attributes of

the dimension table that are often not accessed [Lan07]. If the creation of tables in a star

schema can be called denormalisation, then the attempt to split up dimension tables

from the star schema is indeed an attempt at normalising the star schema as seen in

Figure 4.9. This should only be done if business needs require the creation of such

structures, as it will hamper the performance due to the multitude of joins which will

need to be performed to access the data in the split-up tables [Lan07, Ada10].

Figure 4.9: Star schema showing derived dimension tables [Kim08]

Dimension

Fact table

75

iv. Kimball’s data warehouse

Figure 4.10 depicts Kimball’s data warehouse. It is the collection of all the star schemas

stored within the relational database. There are no separate data marts created due to

the fact that the structure can process queries originating from various departments

[Lob09].

Figure 4.10: Kimball’s data warehouse architecture

This is by no means a complete summary of the processes involved in creating and

maintaining a data warehouse. Many more aspects need to be considered to implement

a data warehouse environment, but these aspects are beyond the scope of this

dissertation. A more detailed discussion can be found in [Lob09, Kim08, Rob07].

4.6 OLAP and data warehouse

OLTP systems used for operational data are designed for optimal speed and throughput

of data. They are designed to capture operational data quickly and effectively, but are

not at all suited to analyse that data.

4.6.1 Data analysis example

To illustrate the statement made above, consider the example below [Rob07]:

Consider sales made to customers: A sale is always related to a customer and a

particular product or service and usually forms part of an invoice. An invoice will also

Data Warehouse

Queries ETL

File

Spreadsheet

Database

76

consist of invoice line items depicting the individual items purchased. Table 4.1 contains

the summarised information relating to a sale, listing a customer and the total spent,

which is used for an invoice summary. Table 4.2 contains the detailed transaction

breakdown, containing products purchased, quantities and totals per product which

will be used as line items on an invoice. In fact, the structure depicted below, is exactly

the way in which most OLTP systems store this type of transactional data, as an invoice

and invoice line pairs.

As mentioned in the section on data warehousing, data needs to be associated with

other data for it to make sense. In a real world environment, business analysts will often

relate the sales data presented above to customers and to products (i.e. to other

dimensions).

In Table 4.3 a multidimensional view of the sales data is presented. For the sake of

simplicity only two dimensions, namely product and time have been compared. The

third dimension, customer, can be thought of as linking product and time.

The data from the linear OLTP tables have been combined into summaries related to

each dimension, as represented in the white block of data above. The total sales, total

quantity and grand total data are aggregations that summarise the given data.

Table: 4.1 OLTP Invoice Table

Invoice_Number

integer

Date

date

Customer

varchar2

Total

double

0001 27/04/2008 MVW Engineering 25,000.00

0002 27/04/2008 STL Systems 16,500.00

0003 28/05/2008 STL Systems 10,000.00

77

Table 4.2: OLTP Invoice Line Table

Invoice_ Number Line Product Price Qty Total

0001 1 RFID Tag 20.00 500 10,000.00

0002 1 RFID Tag 20.00 300 6,000.00

0002 2 SSD 300GB 5,250.00 2 10,500.00

0003 1 SSD 300GB 5,000.00 2 10,000.00

By storing data in a dimensional data structure as the one presented, it is possible for

business analysts to ask questions relating to the various dimensions, such as: “What

was the total sales of RFID tags for May 2008?”.

If the customer dimension were visible in Table 4.3, one would be able to see which

customer was the top buyer of RFID tags. Thus by answering these questions, the

customer market that will most likely generate the largest revenue can be identified,

and appropriate deals or specials can be structured accordingly.

Table 4.3: OLAP Dimensional Table

Time Dimension

Product

Dimension

27/04/2008 28/05/2008 Total Sales Total

Quantities

RFID Tag 16,000.00 1,600.00 17,600.00 880

SSD 300GB 10,500.00 10,000.00 20,500.00 4

Wifi Controller 6,000.00 6,000.00 12

Grand Totals 26,500.00 17,600.00 44,100.00 896

Customer Dimension

78

4.6.2 OLAP overview

Online Analytical Processing (OLAP) systems are designed to handle the analysis of data

as presented in Section 4.6.1. OLAP may be used in conjunction with normal OLTP

systems (such as relational databases) or as an integrated part of the data warehouse

implementation. The differences between the OLAP system and the OLTP system will

now be discussed in more detail.

There are five core benefits that an OLAP system has over an OLTP system in terms of

data analysis [Sch10, Ada10, Cor12]:

Business-focussed multidimensional data – The first thing to note is that data

in an OLAP system is stored in a multidimensional data structure known as a

data cube. A cube will commonly be constructed around dimensions of data,

where each dimension represents a specific business-related view of that data.

The data is organised in such a manner as to easier achieve a business-oriented

analysis.

Business-focussed calculations – The aggregation of data involves making

informed and useful summaries of data. A core feature of OLAP is that the data is

pre-aggregated and all useful summaries are stored in the metadata inside the

system itself. In business terms these summaries often take the form of

calculations and their results. Thus, many calculation’s results are computed and

stored within the system.

Trustworthy data and calculations – There are often many data sources within

the organisation; it might be individuals working on their own spreadsheets or

even various reporting systems with their own databases. Due to the disparity of

the data and calculations based upon them, varying results are produced and

incorrect reports might be produced. An OLAP system will thus be a central

repository of data upon which calculations are based.

Speed of thought analysis – This is more commonly known as ad hoc queries,

meaning that data analysts are able to pose questions to and search through the

79

data stored in the OLAP data source. Due to the multidimensional nature of the

data and the pre-aggregation of data, searches can be performed across any

business dimension very quickly. Classical relational database systems are

unable to respond to queries of this nature without formulating complex select

statements to bring the data together, meaning it is up to the database developer

to retrieve this information.

Self-service reporting – The business users are most familiar with their specific

domain, and have the domain experience needed to analyse the data. OLAP

systems allow the end user to easily create reports and views of the data they

require. They often include comprehensive dashboard tools as well. This

eliminates the need for the development team to continually create

reports/views of data at the implementation level.

4.7 Data Warehouse structures and NoSQL databases

Section 3.5 introduced the NoSQL database notion which comprises alternate

operational databases such as the document oriented database and key/value stores

with MongoDB being one of the popular alternatives to the relational database. Data

warehouses predominantly make use of a star schema which is essentially a

denormalised relational data structure. This requires the de-normalisation of relational

structures into the star or snowflake schemas as mentioned in Section 4.5.2.

Hadoop is an open-source large-data platform which was developed to process data-

sets. A new approach to data warehousing which is gaining traction is to make use of

Hadoop MapReduce to process the data within a MongoDB installation [Had13].

4.8 Conclusion

Chapter 3 gave an overview of operational data and the way in which it is stored in the

modern data environment. Chapter 4 introduced BI and the data warehouse

environment, showing why operational databases (such as the relational database) are

often not well suited to performing a complex analysis of data. The reason is rooted in

the fact that relational databases specifically are designed for high-speed, failsafe and

80

succinct data queries for transactional processing in an operational environment.

Kimball and Inmon’s two different approaches to construct a data warehouse, namely

bottom-up and top-down were compared. The fundamental problem with Inmon’s

approach being that an enterprise wide model must be in place (which would imply a

holistic overview of all business units) before the warehouse structure can be

envisioned (a task often not feasible in large corporations). The approach envisioned by

Kimball suggests that a warehouse is built piece by piece by means of star-schemas,

with the entire warehouse being the set of all sub-star-schemas. This approach is more

feasible as each departmental data mart can be integrated into a larger whole as well.

The document oriented database MongoDB can be used as an alternative solution to the

traditional data warehouse by making use of the Hadoop platform to process the data.

Regardless of which approach is taken to construct the warehouse, it will always

employ highly optimised multidimensional data structures. It is often supplemented

with an OLAP system that allows easier access to the views of dimensionality.

Multidimensional data is depicted in a star or snowflake schema that is stored in

multidimensional data cubes within the data warehouse. Each side of a cube represents

a dimension of data, the more dimensions the more complex the cube and with the

increase in complexity comes an increase in processing power required to analyse data.

Chapter 4 has provided an overview of the tools and methods used by organisations to

analyse the relational data that they so heavily rely upon to service member requests.

By making use of these tools, the business units can draw conclusions by way of reports

in order to make business decisions. The problem is that the ETL processes used to

maintain the data warehouses, which drive the OLAP systems, are often run daily and

nightly, which means the data is always at best one day old.

The reports generated are all still information; they are nothing more than highly

processed transactional data. This is where data mining is introduced. Chapter 5 will

investigate data mining, its relation to data warehousing and will explain why it is

incorrect to refer to an OLAP system coupled to a data warehouse as a data mining

system.

81

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

82

Chapter 5

Data Mining

5.1 Introduction

The previous chapters suggested that there are discrete levels of data. Firstly, as

discussed in Chapter 3, data (as raw facts) can be stored in a database, and by making

use of a query language, these facts can be retrieved as processed information. This data

in OLTP systems are often overwhelming and contain system-derived data, meaning it

has little to offer the user above pure transactional data.

This is where BI steps in, and as defined in Chapter 4, it is a set of tools used to gain

more information from our data. Making use of multidimensional data analysis, a

greater perspective on the data can be gained, the user can ask more in-depth questions

and some predictions can be made from the data in the data warehouse.

OLAP systems have the purpose of providing the user with more tools to analyse the

data presented in OLTP and Relational systems. The warehouse data is essentially

aggregated in multidimensional transactional fact tables maintained for the purpose of

analysis.

However, this is all still information as it is nothing more than highly processed data.

This is where data mining is introduced. Data mining often forms part of the larger

process of Knowledge Discovery in Databases (KDD) and will be discussed in Section

5.2.

Data mining is also a BI tool, and can be defined as the process of discovering hidden

information7 from existing data repositories. It is a collection of processes specifically

designed to derive highly valuable information, which when viewed in the context of the

domain it was mined in, provides knowledge and insight to the user.

7 Hidden information can be referred to as information inside known repositories which requires processing in order to be extracted in a meaningful manner. Without the processing this information could have little value to the owners, however after it has been manipulated it could provide greater insight and hence be more valuable.

83

5.2 Knowledge Discovery in Databases (KDD)

The KDD process is used to derive knowledge from the existing data in databases. Data

mining is a process that should not be seen in isolation as it is actually a sub-process in

the larger Knowledge Discovery Process (KDP) which will be discussed in Section 5.2.

5.2.1 The enterprise KDP environment

In order to gain a clear perspective of the KDP, Figure 5.1 shows the typical layout of an

industry-based data mining environment.

Figure 5.1: KDP in the production environment [Bhus10]

Figure 5.1 provides an overview of the KDP process within an organisation. As

mentioned in Chapter 4 there can be multiple data sources to compile target data with.

The organisation cannot rely on its internal data alone, it will be the responsibility of the

BI team8 to decide how and which data can be included to produce the most relevant

results [Zan99, Vog06, Tan08, Mai08, Bhus10, Bra13].

5.2.2 The source data

The source data usually does not comply with a uniform standard, as the production

environment is often Relational or Object-Oriented. The data warehouse environment is

multidimensional and the other sources might be Rich Site Summary (RSS) feeds,

spreadsheets or raw text file data. 8 A BI team is a dedicated team within an organisation (usually part of the organisational MIS structure) whose purpose is to facilitate operations related to BI (see page 60) such as data mining.

Patterns

Transformed

Data

Pre-processed Data

Selection

Pre-processing Transformation

Data Mining

Interpretation/Evaluation

Data

Target Data

… …

Knowledge

84

Unprocessed data, as mentioned in the source above, often suffer from a number of

defects, namely being incomplete and noisy. Data captured from several inconsistent

systems or sources could contain incorrect/inconsistent values or even missing data.

The source data import process cannot always be automated entirely as it requires the

BI team to ensure that the operational and data warehouse data is accurate, complete,

consistent and interpretable. Methods such as sampling can be used that will aid in the

detection of invalid data and a quantitative analysis can be used to detect noisy data. In

Section 5.3.2 a more in-depth discussion will be given [Kant10, Bra13].

5.2.3 The knowledge base

The knowledge base contains domain-specific data that guide the data mining engine

that provides insight into whether a result is worthwhile. However, this is often only

available in matured environments where the BI team is well experienced [Bhus10].

5.3 Overview of the data mining process

Figure 5.1 provided a high-level overview of knowledge discovery in an enterprise

environment (Section 5.2 overviewed the process). This section introduces the actual

data mining process9 as depicted in Figure 5.2.

9 The data mining process is often controlled by the data mining engine.

Figure 5.2: The data mining process

[Ode05]

85

5.3.1 Determining goals

The most important part of the data mining process is to first and foremost determine

the goals. This step is often overlooked as most BI teams just decide which data to

incorporate and run the mining process based on that. However, if proper planning is

done to determine what knowledge is to be gained, the data can be more appropriately

selected. For example, if the goal is to determine why different demographic regions

have different sales figures, a number of questions need to be asked [Ode05]:

Which products are being compared?

Which regions are being looked at?

Which factors are being considered (socio-economic strata, geographic locations,

etc.)?

Once this has been determined, more detailed decisions can be made on which data

needs to be included: sales data, customer data or financial indicators for the time

period. Sales and customer data might be readily available in the enterprise

environment, but financial indicators could be external information that needs to be

incorporated [Ode05, Bra13].

5.3.2 Data selection

As discussed, data can be gathered from various sources, and it will be up to the BI team

to decide which data should be included in the process. As seen in Figure 5.2, a good

approach is to firstly incorporate the operational data, which in this context, is any

enterprise data, including operational and data warehouse environments. The phrase

“garbage in, garbage out” has never had a clearer meaning than when viewed in the data

mining process. If the target data is of low quality, the entire data mining process is of

little or no use, and the knowledge gained might not add any value.

Figure 5.3 depicts the sources of data for the data mining process. The BI team might

decide to include data from the data warehouse sub-systems, the operational databases

and external sources. Figure 5.2 shows that it is often the case that external data is

86

incorporated after the enterprise data has been cleaned and corrected.

Figure 5.3: Data selection

Chapter 4 introduced the data warehouse, where it stated one of the primary

characteristics as being an integrated environment. Depending on the organisational

needs or the requirements agreed upon in the goal selection process, most of the data

might be derived from the data warehouse environment. This has the advantage that

most of the data should (to a large extent) be clean and consistent. For this reason, it is

often the case that enterprise data mining systems are built on top of existing data

warehouse environments. Figure 5.1 (see page 83) suggests that the data mining engine

interacts with the pre-processed data and is often referred to as the data warehouse

server.

a. Cleaning data

Cleaning data from dispersed locations often comprises of the filling in of missing

values, the resolving of inconsistencies, the dealing with duplicate data and the

smoothing of noisy data on outliers.

87

As stated in Section 5.2.2 there is no automated approach to data cleaning. When the

source data is dirty, human intervention will be required in order to clean the data

before it is incorporated into the target data. The following approaches should be used

by the BI team when the data is prepared:

i. Missing data

Missing data pose a unique problem, in the sense that there was a definite problem in

the data capturing phase. The best approaches to take for data mining purposes are

[Meh03]:

Use an attribute median.

Use a class median (the problem here is that the attribute would need to have

been associated with a class already).

Use regression to determine the most likely value.

Derive the attribute from other data.

If none of these can be applied to the field, it is best to ignore the record as a whole.

ii. Resolving inconsistencies

Chapter 3 stated that a fundamental problem of database systems is that inconsistent

data can quickly become a problem, with a major factor being wrongful data entries. As

seen in Chapter 4, data integration also poses a problem, as inconsistent data can arise

from disparate sources. Thus, if proper database practices are maintained in the

enterprise, with best-effort attempts at reducing inconsistencies, the only inconsistent

data at this stage should be the inclusion of external data. This external data is best

converted and integrated by using the same principles used for the data warehouse, as

discussed in Section 4.3.2 [Ode05, Bra13].

iii. Duplicate data

The process of integrating data will often reveal duplicate data records. This can often

be a result of a record having many similar fields, with only one field differing. The

88

approach is to determine to what extent the records are similar (by comparing fields);

and then to merge the duplicate entries into one record. Inconsistent data will often

arise as this is a form of integration, and this needs to be handled appropriately.

Figure 5.4: Visual representation of outliers

iv. Outliers and noisy data

An outlier can be defined as a record which differs considerably from the other records

in its group or class and is classified as noisy data. The approach followed with outliers

is to detect the outliers and then have a data expert determine if the value is wrong. This

then needs to be removed or adjusted, except if it actually contains valuable information

that needs to be kept [Agg12, Bra13].

As seen in Figure 5.4, the outliers are detected by determining that some data does not

form part of groups. This implies that outlier detection most commonly involves data

clustering. This concept will be explained in Section 5.4.

Outlier

Group A

Group B

Group C

Group D

Outlier

Outlier

89

5.3.3 Data pre-processing

Once data has been cleaned and integrated into the source target data for the data

mining process, the data still needs to be changed in order for the data mining algorithm

to process it effectively. Typical pre-processing tasks include classification, clustering

and dimensionality reduction. These tasks will be explained in Section 5.4.

Figure 5.5: Linear task-oriented overview of the data mining process

5.3.4 Data mining

Once the data has been transformed into the appropriate format, the data mining

algorithm can be used to extract patterns and derive rules. Section 5.6 provides some

detail on how these rules are derived.

5.3.5 Using the knowledge

Once patterns have been established, post-processing is applied to produce meaningful

90

rules and reports. The rules and reports provide new knowledge for the end users to

help them with their business decision making.

5.4 Data pre-processing tasks

As mentioned in Section 5.3, data is firstly cleaned and translated into the proposed

target data for input into the data mining process [Dat05]. The author wants to stress

that target data is often large, necessitating the need of the data interaction layer (as

depicted in Figure 5.1 on page 83). If it is the case that the warehouse data is already

well formatted, this middle layer will be used as an on-line access to that data. This

implies that only data that really needs to be cleaned and corrected will be transferred

to the temporary data preparation environment. If a data warehouse environment

exists, the data standards used in the transfer process will be the norm for all other data

required for the data mining process [Han10, Agg12, Bra13].

An important standpoint taken in this dissertation is that the task of cleaning data,

although seen as a part of pre-processing by other authors [Dat05, Fre05, Gup10], is

part of acquiring the target data and not of the pre-processing of the data to conform to

data mining requirements. The pre-processing task is expensive and thus the BI team

should firstly investigate whether or not the target data is suitable and clean before

proceeding. The following section summarises the techniques used to refine the data

for this purpose:

5.4.1 Data denormalisation

In Chapter 3, the relational database model with normalisation as a key concept was

discussed. A table was transformed into 1st, 2nd and 3rd normal form during the

normalisation process. Data warehousing, covered in Chapter 4, referred to fact tables,

a table of data that denormalised relational data from multiple tables into a single table.

Denormalisation is thus the intentional process of introducing redundancies into the

data in order to speed up performance for data mining tasks. The reason why de-

normalised structures are faster is due to the lack of complex table joins, which must be

performed in a relational database. Data in the data warehouse environment should

already be denormalised and ready for use. Relational and external data which were

91

transferred to tables into the mining data staging environment, should be subject to

denormalisation. A clear understanding of the data is required, and most commonly this

is still a user-driven process [Agg12, Bra13].

5.4.2 Clustering

Clustering is the process of detecting subsets of similar data. The exact nature (or

structure) of the data is not known, therefore we do not know how to classify the data

as part of a specific class, and hence the aim is to group it into clusters which have

similar characteristics [Wan05, Rui09, Eve11, Kog13]. There are three general methods

of producing clusters, namely distortion-based clustering, density-based clustering and

divisive and agglomerative clustering:

a. Distortion-based clustering

In distortion-based clustering,K-Means is the predominant algorithm. In this approach

both the number of potential clusters, 푐 and the sample data set, 푫 are given. The goal is

to group the elements in푫 into the cluster vectors휸 , … ,휸 .The aim then is to search for

the squared Euclidean distance10 between a cluster mean and the potential cluster

elements producing cluster vectors which group elements [Hul04, Gan07, And06,

Eve11, Agg13].

i. General K-Means algorithm

The basic K-Means clustering algorithm is can be performed in the following

manner:

1. initialize 휸 , … ,휸

2. do for each 흑 ∈ 푫

3. calculate 푎푟푔푚푖푛 ∥ 흑 −휸 ∥, 푖 = 1 … 푐

4. 흑 is added to 휸 if 푎푟푔푚푖푛 < ε

5. recalculate ∀ ∶ 휸 ← 푎푣푔(푎푙푙푠푎푚푝푙푒푠 ∈ 푐푙푢푠푡푒푟푖)

6. until no change in 휸 ,∀풊

10 Euclidean distance refers to the distance between two known points.

92

The K cluster vectors 휸 , … ,휸 are initialised containing 0 elements and randomised

cluster centres. Thereafter the Euclidean distance between each element, 흑 ,of the data

set 퐃and the current cluster vector 휸 is calculated. If the error is smaller than an

acceptable value ε, the element is added to the current cluster vector. In line 5 the

cluster mean is calculated for all samples already in the current cluster, to which the

next iteration is compared [Hul04, Eve11, Kog13].

ii. Mean squared error

Line 5 refers to the minimization of the mean squared error (MSE) between the cluster

mean and the potential cluster elements as given below.

푘 − 푚푒푎푛푠푀푆퐸 = 푴 (흑 ) ∥ 흑 −휸 ∥

where 푴 흑 = 1푖푓푖 = arg푚푖푛 ∥ 흑 −휸 ∥0표푡ℎ푒푟푤푖푠푒

This is referred to as the closest cluster prototype for the Euclidean distance and results

in quantization regions – the data set푫is partitioned into non-overlapping regions

[Eve11, Agg13].

Figure 5.6: Clustering example [Hul04]

Cluster 2

Cluster 1

= typical profile

Cluster 3

93

Figure 5.6 visually depicts the output of this clustering process. Data is grouped

according to how well it fits certain clusters’ characteristics (or profile). Cluster

dimensions are usually spherical around the centre or mathematically grouped around

the mean for that cluster [Hul04, Kog13].

b. Density-based clustering

In density-based clustering no assumptions are made, rather density estimates are

calculated at various data points in the data set푫. The hill-climbing or gradient descent

approach is a popular approach [Hul04].

i. Hill-climbing approach

1. do for each element 흑 ∈ 푫

2. search for a higher density rate within a specific

distance K from 흑 .

3. move to the point 흑

4. until no further progress available

The resulting point for each element 흑 is called a locus of the local density peak and the

set of all these loci are the identified clusters. A data point 흑 which has the same

density peak as one of these clusters, is considered an element of that cluster [Hul04,

Kog13].

Figure 5.7: Visual representation of hill-climbing clustering [Hul04]

94

The image on the left of Figure 5.7 shows how cluster loci were identified using the hill-

climbing approach, the image on the right is a top-down view of the clusters and cluster

regions which govern the elements for this data set.

c. Divisive and agglomerative clustering

In divisive clustering the data set is progressively sub-divided until all identified

clusters have been established. Figure 5.8 depicts how, even inside the identified

cluster, the data is divided into smaller clusters. In agglomerative clustering, clusters

are progressively merged as character traits reveal that they have aspects in common

with other clusters [Hul04, Gan07, Eve11, Agg13].

Figure 5.8: Divisive clustering [Hul04]

Section 9.2 provides an overview of how both numeric and non-numeric data is

transformed for use in the Euclidean calculation of distances for clustering purposes.

5.4.3 Classification

Classification refers to the arranging of data into a specific class. These classes are

similar or share some relations. The goal is to find a classifier that will map data to a

class, labelling it as part of that class. Data in a specific dataset’s label distinguish it from

data in another dataset. The difference between classification and clustering is that in

classification you have pre-determined classes of data (known as labels) and a search

for elements that fit into a class [Wan05, Ana06, Han11, Giu12, Led13].

95

a. K-Nearest Neighbour (KNN) classification

Provided with a dataset 푫 and a set of labels 푳 = {(휸 , 푙푎푏푒푙 )} defined for this dataset,

the algorithm to label data elements from that set is given as [Hul04, Giu12, Led13]:

1. For each element 흑 ∈ 푫

2. Map label of 흑 from set L

3. Calculate the K closest elements of푫

4. Label 흑 with labeli depending on the majority of

neighbours with corresponding labels

Figure 5.9: Visual depiction of K-Nearest Neighbour classification

Figure 5.9 depicts how a data element (or sample) of the dataset which does not yet

have a label is given a label. In this sample, there are seven elements with the same class

within the nearest ten searched for. The new element X will be given the same label as

the other seven.

5.4.4 Dimensionality Reduction

Dimensionality of data refers to the number of components that describe the data and in

unlabeled sample

K = 10 nearest neighbours

96

a relational database it is the number of columns in a table [Cun08, Car12b]. From a

mathematical perspective we define dimensionality reduction as:

For the 푛-dimensional variable 휓 = (휓 , … ,휓 ) we are seeking a lower dimensional

representation 휁 = (휁 , … , 휁 ) where푧 < 푥. The content of the original data is

represented by 휓. The 푛 multivariate vectors, such as 휓 , are often called features,

attributes or variables [Imo02, Wan12].

By way of an example, we may look at nature and envision the birds in an estuary. We

know that we are examining birds and hence we may decide to exclude certain features

that distinguish them from other animals. We may decide to ignore common features

like “has 2 legs”, “has one beak” and “almost all species can fly” in favour of features that

we may use to distinguish birds from each other. We might end up with a set such as:

“length of beak”, “colour of feathers”, “wingspan” in order to classify certain birds. We

could use the “length of beak” in combination with “wingspan” to infer if a specific bird

eats nectar from flowers or if it is a bird of prey.

a. High-dimensional multivariate data

High-dimensional multivariate data is subject to the curse of dimensionality, a term

coined by Bellman in 1957. It states that as the number of input variables increase, it

becomes exponentially more difficult to fit data models or to optimise an objective

function in the search space. In terms of data mining, the more features there are, the

more difficult it becomes to extract meaningful data [Tia09, Cun08, Kog13].

High-dimensional data often tends to be noisy, meaning that there are unnecessary

variables describing the data. These complicate the data model and translate to

increased resource requirements to process. The dimensionality of data refers back to

the classification and clustering problems posed at the beginning of this chapter. The

classifier needs high classification accuracy and in order to achieve that it needs to work

with a simple data model. Therefore, the execution of proper dimensionality reduction

is important before any of the other data mining techniques may be applied [Ana06,

Tia09, Kay10, Car12b, Kog13].

97

b. Dimensionality reduction with the goal of classification: the curse of dimensionality

In order to properly explain what happens when a classifier attempts to work with a

sub-optimal highly dimensional dataset, consider the following toy problem:

There are 3 classes of objects that need to be identified in a single dimensional search

space (i.e. there is one feature describing them - 푥 ) as seen in Figure 5.10.

Figure 5.10: Class identification in one dimension [Pas07]

The simplest approach would be to make use of binning, splitting the search space into

three equal bins, and then to compute the ratio of example objects in each bin in order

to determine the dominant class in each bin. In the example above, the predominant

classes for each bin are: red, green and blue with two counts in each bin [Pas07, Zam08,

Wan12].

However, the user detects that the overlap of classes in the bins is high, thus the

introduction of a second feature by which to identify the objects might help.

Figure 5.11: Class identification in two dimensions [Pas07]

Figure 5.11 depicts the introduction of a second feature, 푥 . The number of bins is now

푥

푥 푥

푥 푥

98

nine. The image on the left shows a scenario with the same object density per bin as in

Figure 5.10, while the image on the right indicates a scenario that has the same number

of objects in the search space [Pas07, Zam08, Wan12].

The trade-off is to decide what ratio between density and example count to use. With

constant density the number of examples becomes3 = 27, on the other hand, a low

example count produces a sparse matrix.

Figure 5.12: Class identification in three dimensions [Pas07]

Figure 5.12 shows the same class identification problem but with a third feature,

푥 ,added to the domain. Visually this results in 27 bins and a constant density count of

3 = 81 or a very sparse 3D matrix if the same number of examples is used.

c. Problems pertaining to high-dimensional datasets:

The following are some problems that arise when working with high-dimensional data [Pas07, Wan12]:

The number of examples per variable increases exponentially as the number of

variables increases.

There is an immense increase in volume by adding a higher dimensional feature – consider the jump in the examples between the 2D (27 samples) and the 3D (81 samples) spaces.

The higher the dimension, the more difficult it becomes to distinguish samples.

푥

푥

푥

99

5.5 Dimensionality reduction techniques

In dimensionality reduction, there are two approaches that can be followed, namely

linear and non-linear methods. A linear method is often referred to as a technique called

feature extraction, whereas non-linear methods are referred to as feature selection.

5.5.1 Feature extraction (linear)

The most common linear dimensionality reduction technique is called the Principal

Component Analysis (PCA) [Chu05]. Referring to the datasets defined previously, one

has the following linear transformation:

휁 = 휓퐴

Where 휓 is the original 푥-by-푦 data matrix and 퐴 the 푦-by-푧 transformation matrix,

which will project 휓 onto the 푧-dimensional subspace 휁 with the implied constraint that

푧 < 푥 [Chu05].

The construction of matrix 퐴 is thus the goal. A Principal Component (PC) is an

orthogonal linear combination of an original variable, which satisfies some variance

constraint, and in the case of matrix 퐴, a PC represents one column [Imo02, Boz03,

Ban05, Wan12].

The first PC, 푐 , is the one with the largest variance and can be defined as:

푐 = 푥 휓

The 푥-dimensional co-efficient vector for 푐 is defined as:

휓 = (휓 , , … ,휓 , )

Where 휓 must also solve the following linear constraint:

휓 = arg max‖ ‖ 푣푎푟 {푥 휓}

The second PC will be the linear combination with the second largest variance and also

orthogonal to 푐 . Continuing this pattern, matrix 퐴can be constructed from 푦PCs, hence

100

there are the same number of PCs than the original variable, each having less variance

[Imo02, Boz03, Ban05].

For very large datasets the first few PCs often define most of the variance in the set and

the others may be disregarded. Using this approach, a much leaner set of data can be

built, which explains most of the data from the original dataset.

The calculated variance however often depends on the magnitude of the variables (i.e.

how similar in measurement the data is to begin with). It is the norm to firstly ensure

that data has a standard deviation of one and a mean of zero.

Thus the co-variance matrix Σ is:

Σ × =1푛휓휓

and can be used in conjunction with the spectral decomposition theorem to derive Σ as

Σ = UΛU

Where Λ = diag λ , … , λ is a diagonal matrix of ordered eigenvalues, λ ≤ … ≤λ . U

is the orthogonal matrix made up of the described eigenvectors and U represents the

derived projection matrix 퐴 [Imo02, Boz03].

The principal components can be derived in the following way:

휁 = 푈 휓

where 휁 represents the 푧 dimensional sub-space 휓 that is projected onto the

component. 휁 is now regarded as a reduced dataset upon which data mining tasks can

be carried out [Imo02, Boz03, Ban05, Wan12].

5.5.2 Feature selection (non-linear)

Feature extraction essentially applies a mapping to the original dataset to derive a new

smaller set. Feature selection aims to find a subset of the original dataset by evaluating

the relevance of each variable from the original set [Sik07, Khu08, Tan09, Wan12].

101

Hence heuristic methods form the essence of this technique [Hus02]. These methods

will be investigated in Chapter 8 of this research.

The following section describes the data mining algorithm’s rule extraction phase,

which uses the features as derived by the techniques introduced in Section 5.5 as an

input.

5.6 Rule Extraction

After the data pre-processing tasks have been completed, the data that must be mined is

in an appropriate format for the data mining algorithm to run effectively. A data mining

algorithm can also be classified as a search algorithm, as it executes a process known as

rule extraction. In essence it builds a Boolean expression tree comprised of IF-THEN-

ELSE statements, which represents the rules applicable to the data [Wan05, Wit05,

Ant09].

Rule extraction is a process that can be broken up into two distinct processes [Agg12,

Bra13]:

Searching for rules based on output classes in a dataset containing well-defined

labels.

Searching for rules in datasets that do not contain labels, this process is defined

as association rule mining.

The two processes listed above form part of the rule mining and extraction part of the

KDP and they will be discussed in the following section.

5.6.1 Association Rule mining

Association rules are the mathematical expressions of IF-THEN-ELSE statements and

are most commonly used to process transactional data. The expression 퐴 ⇒ 퐵 implies

that if a transaction 푇 ∈ 퐷 (where D is the transactional dataset) contains A then it will

also contain B [Gho04, Wan05, Vog06, Kuo09, Raj11, Bra13].

By way of an example, an association rule derived from online shopping transactions

102

could show that people who bought a laptop were also likely to buy a laptop bag or

peripherals such as a mouse. Thus the derived rule could be used to improve the

structure of the online store, to for instance advertise these items on the laptop section

or to bundle them with laptops as part of deals.

5.6.2 Rule extraction algorithms

Authors proposed many rule extraction algorithms, including neural networks, fuzzy

logic systems and Genetic Algorithm based ones. Regardless of which algorithm is

chosen for the extraction process on a specific dataset, the following pre-conditions

apply [Gho04, Wan05, Wit05, Ols08, Bra13]:

The dataset must be pre-processed as described earlier in this chapter.

The values of the dataset should be normalised, meaning that they should fall

within a defined range. For most algorithms, the range [0,1] is used.

5.7 Conclusion

Chapter 5 introduced the concept of data mining, why it is used and how it relates to the

other data warehouse activities such as OLAP analysis. Data mining is the process of

transforming raw information into knowledge. The output of data mining is knowledge

in the form of tangible rules, and these rules can be used to discover what trends,

relations and hidden facts are present in the data.

It showed that data from the data warehouse and other sources needs to be de-

normalised and restructured for a data mining algorithm to be applied on it. The task of

data pre-processing is a three-part process:

i) classification which determines if the data is part of a specific class,

ii) clustering which aims to classify data that could not be mapped to a class, and lastly,

iii) dimensionality reduction.

Dimensionality reduction is the most important part of the process, as it reduces the

103

original dataset into a smaller sub-set and allows for a normalised dataset.

The data mining process is often an afterthought for many organisations as they

attempt to collate data from various sources when the need arises. Effusive data from

relational databases and highly structured data from data warehouses is selected as

target data, this data needs to be transformed and then the pre-processing tasks can be

run on this data before any data mining algorithms can derive rules. Chapter 8 will

introduce the EPS4DM model which addresses this notion of data mining as being an

afterthought, it incorporates all three the pre-processing tasks as well as data extraction

from a relational database in order to maintain pre-processed data. This allows the data

mining algorithms to have a repository of pre-processed data at their disposal.

Chapters 6 and 7 will investigate in detail the way in which Intelligent Agents and

Computational Intelligence (CI) based methods will be applied in the EPS4DM model’s

pre-processing tasks.

104

1. Introduction


warehousing




12. Implementation

and Results



EPS4DM Model


105

Chapter 6

Intelligent Agents

6.1 Introduction

Intelligence is not an easily defined concept. The Oxford English Dictionary defines it as

“the ability to acquire and apply knowledge and skills”. Merriam-Webster defines it as

“the ability to learn or understand or to deal with new or trying situations” and as “the

ability to apply knowledge to manipulate one's environment” [Oxf13].

The author defines Intelligence as “the ability to understand and to learn from

experience”. More specifically it is the ability to learn or “catch on” without solely being

taught by another agent/entity or being told in what way to by another intelligent

agent. Building upon the aforementioned, intelligence allows for logical reasoning, the

ability to plan, and the ability to solve problems. In essence it is the premise for rational

thinking – the ability to make the correct decisions based on specific conditions. One

gains knowledge through experience and intelligence allows people to correctly apply it

to solve a problem.

6.2 Artificial Intelligence

Artificial Intelligence is the science of understanding the nature of thought and

intelligent behaviour, and to then derive a way to mimic this behaviour in a computer-

based system. Its main objective is to create intelligence in computer-based systems,

either with regard to human behaviour or without reference to it. In the former case, it

attempts to recreate the capabilities of the human mind, in both the ability to make

decisions and to learn. In the latter case, its goal is to create seemingly intelligent

systems which solve tasks for people. An example would be an intelligent search bot,

which sifts through information. Artificial intelligence not only deals with recreating

reasoning and problem solving, but also with other real-life aspects such as Computer

Vision and Natural Language Processing [Rus10].

106

6.3 Intelligent Agent

An intelligent agent is the artificial representation of an intelligent entity that can

interact with its environment, learn from it and use its ability to reason in a way to

achieve certain goals that are vital to its survival in that environment. An agent must be

capable of autonomous action within its environment; it should not rely on human

intervention to operate but rather make use of its own precepts to react to change

[Lon09, Woo09].

An intelligent agent perceives its environment and takes the appropriate actions to

achieve optimal success within that environment. The agents are required to exist for an

arbitrary time, to adapt to changes within their environment and to achieve certain

goals (such as navigating a maze). Agent learning is a core attribute, whereby an agent

will use past experiences and gained knowledge to achieve its goals in an easier and

quicker manner than before. An intelligent agent’s main goals are the following: display

deductive reasoning and make inferences with regard to an operating environment; and

make decisions to accomplish a goal, based on either past experience, current

information or, adversely, the lack of the aforementioned [Rus10].

Agents can learn in any number of ways, including: single-agent learning that allows an

agent to learn from its operating environment (such as learning to successfully navigate

a maze) or from a combination of the aforementioned and another entity within the

environment (such as learning an opponent’s strategy whilst playing a game). Agents

are expected to accomplish their task in the best possible way at all times, even if there

exists some uncertainty, in which case they should opt for the best expected outcome.

6.3.1 Components of an Agent

An agent is essentially a combination of reference architecture and a program

implementing it. A very simple agent may have:

Sensors – These are more accurately described as the agent’s percepts and allow

interaction with the environment.

107

Condition-action rules – They determine what actions need to be taken.

Actuators – They interact with the environment to produce the desired results.

Figure 6.1: Architecture for a simple reflex agent [Rus10]

Figure 6.1 depicts the architecture for a very basic agent. The following is the agent

program for the abovementioned agent:

function SIMPLE-REFLEX-AGENT(percept) returns action

static: rules – a set of condition-action rules

state INTERPRET-INPUT(percept)

rule RULE-MATCH(state, rules)

action RULE-ACTION[rule]

return action

As can be seen, depending on the type of input the agent received from its sensors (or

percepts) an appropriate action is chosen from a set of known rules, which determines

what action the agent should take [Rus10].

6.3.2 Properties of Agents

Agents can exhibit properties, among these as defined by Wooldridge [Woo09] are:

Environment

Agent Sensors

ACTION

Actuator

Condition-action rules

108

Autonomy – an agent must be able to operate without any external intervention

Sociability – agents must be able to communicate with each other specifically in

multi-agent systems

Adaptability – agents should be able to react and adapt to changes within their

environment

Pro-active behaviour – the converse of adaptability, whereby agents should be

able to take action to achieve their goal and not just react to changes

According to Franklin, there are four key attributes that differentiate agents from

simple programs, they are: reaction to the environment, autonomy, goal orientation and

persistence [Fra96]. Agents operate within environments and the next section will

discuss a few environments in which an agent can be applied.

6.3.3. Agent environments

All agents operate within an environment that they perceive. According to Russell

[Rus10] the following different environments can be identified:

Fully observable vs. partially observable – In a fully observable environment the

agent’s sensors will be able to perceive the entire space. In a partially observable

environment only a subset of the environment will be perceived.

Deterministic vs. stochastic – A deterministic environment is one in which the

next state will be determined by the agent’s actions (such as in a game of chess)

whereas a stochastic environment produces more random changes between

states.

Static vs. dynamic – In a static environment there are no external changes

between agent actions. In a dynamic environment there could be changes that

occur while agents take actions.

109

Discrete vs. continuous – A discrete environment will have a finite set of possible

states, while a continuous environment could have an infinite number of states.

Single agent vs. multi-agent – A single agent environment is one in which there is

one agent trying to achieve its goals. In a multi-agent environment there are

many agents, each working together to achieve a common goal.

The following section provides an overview of the different agent types.

6.3.4 Agent types

A list of the different types of agents that can be employed in an intelligent system will

now be provided:

Reflex agent: An agent whose actions rely solely on its current percept, not

taking into account any previous percepts. Figure 6.1 (see page 107) depicts a

reflex agent.

Model-based reflex agent: This agent keeps internal state information that

relates to the current environment in which it operates (essentially it has a

“model” of its environment which it continuously updates). A model-based reflex

agent is defined as an agent that makes use of such a model.

Goal-based agent: This agent defines knowledge about which goals it has to

achieve in its environment. It chooses its actions to achieve these goals.

Utility-based agent: This agent relies on a utility function that will map its

current state to a real number in order to decide whether the agent will be

“happy” if its current goal is achieved. This allows for the creating of trade-offs

when conflicting goals arise.

Learning agent: This is an agent that uses its past experiences and learns from

them in order to improve its performance [Rus10].

110

6.4 Multi-Agent Systems

In a Multi-Agent System (MAS), the focus is on distributed computing. A specific agent

will be created for each type of problem that has to be solved. This makes agent

implementation easier as one agent only needs to focus on solving one problem in the

system. A MAS consists of a network of intelligent agents, where each individual agent

has to focus on its own task, but the agent will need to interact with other agents to

solve the greater problems at hand [Wei00, Sho08, Lon09].

The introduced problem is that agents now need to communicate with each other to

complete all the goals required by the system to produce its desired results. To obtain

this purpose, an interaction protocol will need to be envisioned that closely relates to

the environment and the goals the intelligent system finds itself in. The concept of a

MAS can be related to the way in which a team of individuals work together to achieve

their set objectives.

The common agent characteristics are mentioned in the next section [Kep10].

6.4.1 The characteristics of the agents in a MAS

The characteristics of the agents that form part of a MAS are [Kep10]:

Agent autonomy – each individual agent is required to operate on its own.

Localisation of percept information – individual agents do not have a view of the

entire global system.

Decentralisation – the system cannot be controlled from a single point, each

agent works together with others to achieve their set goals.

The following section describes some advantages of making use of a MAS.

6.4.2 The advantages of using a MAS

The following are advantages of making use of a MAS as opposed to single agents

[Uhr09, Car12a]:

111

A system making use of single agents will be subject to performance bottlenecks,

single point failures, as well as resource limitations on the hardware

infrastructure. A distributed MAS is decentralised and it can be tailored to avoid

these problems.

The data and information required by the agents is often spatially distributed

and thus a MAS allows for the effective retrieval and global coordination of this

data for use by each agent in the MAS.

A MAS more accurately mimics real-world task allocation. In a team situation,

each member is assigned a task to work on and contributes to the global success

of the team’s goals. Similarly a MAS allows for the use of multiple interacting

agents working together to solve a problem.

In situations where legacy systems cannot be rewritten, an agent wrapper can be

created for the system and this agent can then be incorporated into a MAS.

A MAS allows for increased performance by increasing the computational

efficiency of the tasks at hand and add characteristics to the overall system such

as: reliability, flexibility, load-balancing and scalability.

The next section describes how agents interact and cooperate within a MAS.

6.4.3 Interaction and cooperation in a MAS

Agents within a MAS are required to interact with each other and the most common

form of communication is via signals (or messages) that have defined interpretations.

Multi-agent planning is a way of ensuring that agents interact according to some set of

rules. Social laws can also be used whereby each agent is aware of what they can or

cannot do [Tab09, Rus10].

An example of this would be a vehicle agent that is aware that it needs to drive on the

right-hand side of the road in its environment. Agents do not always collaborate and

could sometimes be in competition with each other. Agents in an environment that

compete against other agents are aware that there are other agents and that their

112

actions will influence each other in such a way that one would be successful and another

not [Tab09, Rus10].

In a cognitive MAS agents could interact by making use of a principle from speech act

theory. This concept relies on two rules pertaining to the messages passed, namely the

content of the actual message and most importantly the intention of the message. The

Knowledge Query and Manipulation Language (KQML) is a common message passing

protocol used in a MAS. The Knowledge Interchange Format (KIF) is used within the

protocol and describes first-order logic rules pertaining to the message [Tab09].

In order for agents to cooperate within the MAS, they need to have an aligned set of

goals. The two main approaches are that of cooperative and self-motivated agents.

Cooperative agents each have their own set of capabilities to solve the task they have

been assigned. When combining multiple cooperative agents each agent solves their

assigned problem and thus contributes to the global solution of the problem, thus they

are working as a team to solve a common goal. Self-motivated agents have a set of goals

that they need to achieve and often compete against each other to achieve them. In

order to achieve proper agent coordination each agent must be aware of the existence

of other agents and thus coordinate their actions with each other accordingly [Tab09].

6.5 Conclusion

Intelligent agents allow intelligent computing systems to be created. These are systems

that can solve complex problems. At the centre of any intelligent system is the agent, a

self-contained unit that can interact with the problem environment or other agents and

produce the required actions to meet the system’s goals. Multi-agent systems allow for

the distribution of intelligent agents within an environment to solve complex problems,

not only does this bring increased performance, but it can also create a system with

better scalability and increased robustness.

As alluded to in Chapter 5 the EPS4DM model aims to perform the data pre-processing

tasks by following three steps, namely data extraction, clustering and classification. To

this end the EPS4DM model makes use of cooperating agents to control each of these

actions. There are three main pre-processing actions namely denormalisation,

113

classification and clustering, each of which is a goal that needs to be achieved within the

EPS4DM model.

Chapter 6 has provided an overview of intelligent agents, their properties and the

environments they can be deployed in.

Chapter 7 introduces Computational Intelligence, a branch of Artificial Intelligence that

studies nature’s approaches to solving problems. The techniques discussed in Chapter 7

will be form the basis of an agent’s action program.

114

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

115

Chapter 7

Computational Intelligence

7.1 Introduction

Chapter 6 provided an overview of intelligent agents, as they form the basis of the

operation of the EPS4DM system. Their functionality is however dependent on methods

found within the branch of heuristic algorithms known as Computational Intelligence

(CI). CI deals with adaptive environments and biological-inspired methods. Normal

techniques often fall short and cannot solve complex problems as accurately as CI based

techniques can. The most popular CI techniques are Artificial Neural Networks,

Computational Swarm Intelligence, Fuzzy Systems and Evolutionary Computation.

Chapter 7 will provide an overview of Computational Swarm Intelligence and the

Genetic Algorithm (GA). The EPS4DM model makes use of a Particle Swarm

Optimisation (PSO) algorithm for its clustering operation and a GA for its classification

and dimensionality reduction operation.

7.2 Computational Swarm Intelligence

Computational Swarm Intelligence is based on the natural emergent behaviour that is

evident in self-organised systems, such as shoals of fish, flocks of birds or a colony of

ants [Kar09]. There are two main approaches in Swarm Intelligence (SI), namely

Particle Swarm Optimisation and Ant Algorithms [Eng07, Kir13].

7.2.1 Particle Swarm Optimization

Particle Swarm Optimisation (PSO) is an optimisation technique that is inspired by the

collective intelligence found within swarms, such as a flock of birds or a shoal of fish.

The idea is that particles move around within an n-dimensional search space of the

problem domain, and each particle searches for an increasingly improved solution. Each

particle remembers its own best solution and evaluates it with the best solution found

by the swarm (or a subset thereof), i.e. it uses both its own information as well as its

neighbourhood’s information. These particles each have a velocity 푣, with which they

116

move from their best solution to the next best solution as determined by a fitness

function. As a result, the swarm will converge to the best global solution of the problem

[Bla02, Chi05, Mur06].

The components of a PSO are discussed in the remainder of this section:

i. Particle Velocity

Each particle has a velocity at which it travels within the search domain. Depending on

the type of PSO algorithm employed, the components of the velocity vector 풗, are

[Eng07, Ols12, Kir13]

풗(푡 + 1) = 푤풗 (푡) + 풄 (푡) + 풔 (푡)

Equation 7.1 – Particle velocity

Equation 7.1 presents the base global-best velocity update equation as used in the gbest

PSO algorithm. Its constituents are [Eng07, Ols12, Kir13]:

Local component (particle’s current velocity) – 풗 (푡).

Inertia weight, controlling the trade-off between the exploration, 풗 and the

exploitation 푤.

Cognitive component: 풄 (푡) = 푐 푟 (푡) ∙ [풚 (푡) −풙 (푡)](taking into account

the particle’s personal best solution found so far as component, 풚 (푡)).

Social component: 풔 (푡) = 푐 푟 (푡) ∙ [퐳 (푡)−풙 (푡)] (taking into account the

swarm’s global best solution found so far as component, 풛 (푡)).

The velocity update occurs for particle 푖 in dimension 푗 of the domain at time interval 푡.

The other components are [Eng07, Ols12, Kir13]:

The particle’s current position in the domain – 풙 (푡).

푟 , 푟 , are stochastic elements chosen from a standard Gaussian distribution.

117

푐 , 푐 , are acceleration constants governing the stochastic elements.

7.2.2 Particle Position

As mentioned in Section 7.2.1 the particle’s position, 풙 (푡), forms part of the velocity

update calculation as its distance from more optimal search regions are taken into

account. The particle position represents one possible solution to the problem domain

and is used in fitness evaluations. In its most basic form the particle position update is

the simple summative component [Eng07, Ols12, Kir13]:

풙 (푡 + 1) = 풙 (푡) + 풗(푡)

Where 푣(푡) is the current velocity as calculated in Equation 7.1.

7.2.3 Exploration vs. exploitation

The velocity equation plays a key part in the PSO algorithm; after all, this is what drives

particles to move. However, particle velocities can quickly become exponential,

resulting in large step-sizes that are taken in each iteration. Exploration is required in

the initial stages of optimization to allow the swarm to search many areas of the

domain. Exploitation is needed in the later stages, allowing the refinement of solutions

within a certain region of the domain. Section 4.2 investigates this trade-off [Eng07,

Sim13].

7.2.4 Fitness evaluation

In terms of mathematical minimisation, the fitness function is the problem that needs to

be optimised. In the case of data mining, Chapter 8 will investigate the details of how

PSO can be applied to the data mining pre-processing tasks.

7.2.5 Basic Algorithm

In its simplest form, the basic PSO algorithm will consist of two procedures [Eng07,

Ols12, Kir13]:

Equation. 7.2 – Particle position

118

Step1: Calculate the particles’ best positions

Making use of the fitness function defined for the problem domain, the best positions

are calculated for:

Each particle at iteration 푖 – 풚 (푡) in Equation 7.1

The entire swarm at iteration 푖 – 풛 (푡) in Equation 7.1

Step2: Move particles

Calculate the new velocity of a particle at iteration 푖 using Equation 7.1.

Calculate the new position for a particle at iteration 푖 using Equation 7.2.

This process introduces two concepts, namely synchronous and asynchronous updates.

The base algorithm uses synchronous updates, whereby the best positions are

calculated per iteration and then particles are moved. Asynchronous updates allow for

the best positions to be calculated immediately after a position update has occurred,

thus providing immediate feedback to the algorithm [Eng07].

Steps 1 and 2 will be repeated until the swarm has converged on a best solution, or

more accurately, when the swarm stopping conditions have been met. Figure 7.1 depicts

the way in which a few particles will move at a velocity 푣 for each iteration.

Figure 7.1: Particles moving towards the new global best [Eng07]

7.2.6 Initialization

To get the algorithm going, the initial positions for each particle are chosen using a

푣 푣

푣

119

stochastic method and the velocity set to 0. This initial position is also set as the

particle’s current best position. This allows the velocity update equation (Equation 7.1)

to have an initial position with which to start at the first iteration of the algorithm.

Initialisation is more important than it seems, since a bad initialisation of particles can

influence the swarm’s chances of finding the global minima.

7.2.7 Stopping conditions

Specific stopping conditions for PSO include capping it at a maximum number of

iterations, stopping if there is no significant improvement in the solution for a number

of iterations or when the swarm radius is small enough to assume that the swarm has

converged [Sim13].

7.2.8 Charged Particle Swarm Optimization (CPSO)

CPSO is a popular PSO-based algorithm that offers improvements over the basic gbest

implementation. In this method, each particle within the swarm is given either a

positive or negative charge, instead of just a generic velocity. Each particle’s velocity is

either 0 (neutral) or a value greater than 0 (positive) or a value smaller than 0

(negative) The aim with CPSO is that particles with similar charges will repel each

other, in the hope that they will find other (and possibly improved) solutions within the

domain.

CPSO aims to solve the problem of lost diversity. If the diversity factor drops below a

specified minimum, the particles moving away from each other, rather than towards

each other ensures that more possible solutions are explored [Mur06].

The normal procedure of PSO will be followed. However, once a specified diversity

factor (which regulates and controls how much of the search domain particles have

explored) drops below a specified minimum, the following will happen, as shown in

Figure 7.2:

Particles are assigned a value that represents a charge, as described above; this

charge relates to the individual fitness evaluation of each particle.

120

Instead of particles moving toward the determined global best solution, they will

be repelled from each other.

This will ensure that there is enough diversity in the search domain in order to avoid

premature convergence at a locally determined optimum that is perceived as the global

optimum. Therefore, other possible solutions are explored [Mur06].

Figure 7.2: Charged particles being repelled [Eng07]

In ensuring that there is adequate diversity within the search domain, it will give the

PSO algorithm its best chance of finding the global optimum. The multi-swarms

technique is another technique that is used to achieve this.

7.2.9 Multi-Swarms

The problem of lost diversity can also be solved by employing multiple swarms within

the environment. This large colony of swarms can then interact with each other in the

same way individual particles do in order to find the optimum. However, much care

must be taken to ensure that swarms do not readily converge on the same solution

within a highly dynamic environment [Bla02, Fon09].

In order to solve the problem of outdated memory, authors suggested that particles

should do a global re-evaluation of the fitness function in order to ensure that their best

solution is still the same solution at each iteration of the search [Bla02].

As mentioned a multi-swarm consists of a colony of swarms. In this adaptation, an

atomic analogy of electrons moving around a nucleus can be made: a swarm consists of

a normal PSO swarm and particles that swarm around it to find other best optima

푣

−푣

−푣

121

within the same region. The idea is to place a swarm on each best local optimum within

the entire domain. The particles swarming around a peak can keep track of how and

when it changes, hence allowing an up to date recollection of the best optimum

(solution to the outdated memory problem). These swarms can then work together to

converge at the global optimum [Bla02, Fon09].

7.3 Evolutionary Computation

Section 7.2 introduced the concept of swarms and particles by way of an optimisation

technique which does a heuristic search of the said domain. Evolutionary Computation

is a branch of Computational Intelligence that is based on the Darwinian concept of

evolution. More specifically it relies on the theory of the survival of the fittest, which

states that creatures keep the features that allow them to survive in their environment.

Conversely, they also get rid of the features holding them back [Kar08].

Systems developed on the concepts of EC will employ evolutionary processes including

natural selection, reproduction, mutation and the survival of the fittest. They often make

use of populations of individuals that are evolved in a search space [Eng07].

7.3.1 Genetic Algorithms

A Genetic Algorithm (GA) is a stochastic search technique originally described by

Holland in 1975 and is based on the Darwinian theory of natural selection. The

biological theory states that organisms adapt in order to survive and this evolution is

driven by the notion of allowing the fittest organisms to survive and the weaker ones to

die off [Sim13].

The following sections will describe the important aspects of the GA:

i. Search Space

The search space X must be represented in an appropriate way for the GA to be able to

traverse it, hence the vector X will be represented by some string s of length l which

exists of elements from an alphabet A, which defines the mapping [Hau04, Sim13]:

122

푐 ∶ 퐴 → 푿

This can now be formulated into an appropriate search space:

S ⊆ A

The length of the string l, depends on the dimensions of both X and A. In the context of a

GA this mapping is called the genotype-phenotype mapping and the elements of this

mapping are called genes and the actual values are called alleles.

Thus the GA employs a population of strings representing possible solutions to the

search domain and is often referred to as chromosomes, and it is common practice to

make use of binary strings as representations [Sim13]. These chromosomes are subject

to certain operations that will be defined next.

ii. Crossover

The crossover of two pairs of chromosomes produces a child chromosome that will

have the characteristics of the parents. This is a simple process of replacing a set of

genes in one parent from those of the other parent. Figure 7.3 depicts a binary string

chromosome with a randomly selected crossover point. The sub-set of genes is then

swapped to form the child [Hau04, Aff09, Sim13].

1010110010101 011100100010 1011001010011 010100101010

1010110010101 010100101010 1011001010011 011100100010

Figure 7.3: Crossover operation of two Chromosomes [Eng07]

Randomly chosen crossover point

1st parent’s genetic code 2nd parent’s genetic code

1st offspring’s genetic code 2nd offspring’s genetic code

Chromosomes after crossover

123

iii. Mutation

The mutation of two pairs of chromosomes will produce a child chromosome that is

nearly identical to both parents as there are only a few changes to the chromosome. In

this operation a subset of genes are chosen at random and the allele values of those

selected are changed. For a binary mapped string this is as simple as calculating the

complement of the bits. Referring to Figure 7.3 the 1st parent’s genetic mapping with

genes 3 and 5 selected will now become: 1000010010101011100100010 [Hau04,

Eng07, Aff09, Sim13].

iv. Fitness function

As is the case with the PSO described in the previous section the GA also requires a

fitness function that will evaluate each chromosome’s fitness [Tra09]. The fitness

function and its results are used as stopping conditions for the GA algorithm which will

be defined next.

v. GA algorithm

The GA makes use of the operators defined to manipulate a set of chromosomes and can be defined in the following algorithm [Aff09, Sim13]:

Select initial population of chromosomes;

while termination conditions not met do

repeat

if crossover condition met then {select parent chromosomes; choose crossover parameters; perform crossover;}

if mutation conditions met then {choose mutation points; perform mutation;}

evaluate fitness of children

until sufficient children created;

select new population;

endwhile

124

This pseudo-algorithm serves as a basis for any GA implementation and refers to user-

specified crossover and mutation conditions.

7.4 Conclusion

Chapter 7 has introduced two methods within CI namely the PSO and the GA. The PSO is

based on swarm intelligence and draws on the biological inspired notion of flocks

whereby particles move through a search space and find optimal solutions.

The GA is based on the notion of evolution whereby solutions to the problem are

defined and are manipulated with biological inspired operators in order to find optimal

solutions within the search space.

As discussed in both Chapters 5 and 6, the CI-based techniques that were introduced in

this chapter form the basis of the classification and clustering processes that will be

used in the EPS4DM model.

Chapter 8 serves as an overview of the proposed EPS4DM model and presents the data

pre-processing operations based on the algorithms defined within this chapter.

125

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

126

Chapter 8

The EPS4DM model

8.1 Introduction

The data mining pre-processing steps as discussed in Chapter 5 that will be examined in

this chapter can be grouped into clustering and classification as a dimensionality

reduction technique. Chapters 6 and 7 introduced the tools required to construct the

pre-processing system that will be defined in this section.

The purpose of this chapter is to provide an architectural overview of a data pre-

processing system to transform data in an intelligent manner to enhance its suitability

for data mining operations. The Extract, Pre-process, Save for Data Mining (EPS4DM)

model that will perform the pre-processing tasks required on a chosen dataset is

introduced. It will also transform it into the formats that can be accessed by data mining

algorithms from a data mart. Both data clustering and classification techniques require

the relational data to be denormalised into feature vectors.

This chapter also elaborates on how multi-agent technology is used to orchestrate the

processes of extracting the relational data into the formats required by the classification

and clustering processes and how the results of these sub-processes are used to

maintain the data mining mart.

The next section will provide a comparison between the traditional ETL processes that

are used and the one proposed by the EPS4DM model.

8.2 Traditional ETL compared to the EPS4DM model

As discussed in Chapter 5, ETL processes are required to extract data from relational

tables, transform them into the formats required (a star schema) and then do a bulk

load of the current live data snapshot into the warehouse. These tasks are usually run

once a day. When data mining tasks must run, a selection of the source data and where

it needs to be aggregated from must be determined before the pre-processing tasks can

127

Relational Database subset

EPS for Data Mining (Classification, Clustering) - pre-processing

Data Mining MartStored pre-proccessed data for the subset

Reduced subset size

be run. This is depicted on the left-hand flowchart in Figure 8.1.

Figure 8.1: Traditional ETL compared to the EPS4DM model

Figure 8.1 depicts the traditional ETL process on the left-hand flowchart where it can be

seen that the data selection and pre-processing tasks occur late in the process as it is

Relational Database subset

ETL for MISDepartments select

structures for MIS reporting

Data Warehouse- Fact TablesStill highly structured(Usually One day old)

Subset Selection, data pre-processing tasks

Data mining Algorithms Extract rules from dated data

Data mining Algorithms Extract rules from pre-processed relational data

Data Marts

Data Warehouse

Denormalisation PSO Clustering GA Classification

(b) EPS4DM model

(a) Traditional ETL

RDBMS RDBMS

128

driven by the need to perform data mining on the existing data. The right-hand

flowchart depicts the proposed EPS4DM model’s approach whereby the selection of

data and the pre-processing thereof is done beforehand. The EPS4DM model proposes

that the data mining pre-processing tasks can be setup to run from the relational

database. By storing the processed data the rule extraction algorithms have a repository

that they can run from. This process is depicted in the right-hand flowchart of Figure

8.1.

Sections 8.3 and 8.4 provide an overview of how agent technology is used to orchestrate

this process.

8.3 Clustering

As defined in Chapter 5, clustering is the process of detecting subsets of “similar” data.

The exact nature (or structure) of the data is not known; in other words the

programmer does not know how to classify the data as part of a specific class, and hence

the aim is to group it into clusters that have similar characteristics. This section will

provide an overview of a semi-supervised PSO-based data clustering technique to

extract data from multiple source tables in a relational database and to effectively

aggregate this data into the data store.

As discussed in Chapters 4 and 5, the relational database is based on relations. The

target format for a data mining algorithm is fundamentally different as it requires data

to be stored in an attribute-value format. Effectively this constitutes a denormalisation

of the original relational structure and this is the first step in the clustering process as

defined in Figure 8.2.

The clustering process can be broken down into three constituent parts, namely:

Dataset selection

Data extraction and denormalisation

Cluster centre calculations and dataset clustering

129

8.3.1 Data extraction and denormalisation

In the first step, the user selects the set of target tables from the relational database

which contain the data that will be clustered. A key concept that will be discussed in

Chapter 9 is that of the associativity between target records. This is shown in Figure 8.3.

The target tables will inevitably contain links to data in non-target tables and users wish

to break down this high degree of associativity for pre-processing purposes.

The relational structure as depicted in Figure 8.3 needs to be transformed into a set of

Figure 8.2: The clustering process as used in the EPS4DM system.

Relational Database

Data Extraction

Data Transformation

Feature Vectors

Cluster Center Calculations

Outlier Detection

(cluster purity)

Clustered Data

130

feature vectors that will become the search space for the clustering process.

Figure 8.4 shows the structure of the data extraction and denormalisation process.

There is an active RDBMS interface which is setup to monitor the selected tables in the

database. When its trigger conditions are met (it could be each time a certain amount of

records are added, or when a specific time of day is reached) it will signal to the data

extraction agent that the data in this dataset structure needs to be extracted.

Foreign Key

R R R R …

R

Figure 8.3: Associativity between target records [Alf07, Ray10].

Figure 8.4: Deriving feature vectors from relational data

The data extraction agent will build an in-memory model of the required relational

structure for this specified dataset. This model is then passed to the data transformation

agent that will produce the feature vectors for this dataset and persist with them for use

by the clustering agent. The data change signal is a core element as this will trigger the

T T T T T T T

T T T

Data extraction Agent RDBMS RDBMS Interface

Data change signal

Data transformation Agent Mart Interface

Feature Vectors

131

data extraction agent which will then orchestrate a new set of feature vectors to be

derived for the dataset.

8.3.2 Cluster centre calculations and dataset clustering

As mentioned in Chapter 5, the first part to producing a clustered dataset is to find

cluster centres. These centres form the base reference points from which other datum

points are grouped into clusters. A PSO-based cluster centre computation is proposed

and the details of this will be discussed in Chapter 10.

The PSO algorithm will effectively produce a set of cluster centres for the set of feature

vectors which was derived in the extraction process overviewed in Section 8.3.1. The

Euclidean distance between each data point and its cluster centre is used to determine

which cluster the data point belongs to.

Figure 8.4 gives an overview of the way in which the relational data is firstly

denormalised by building the feature vectors, persisting with these for the given data

set so that the clustering agent can use that as its source.

Figure 8.5: The data clustering process

Figure 8.5 depicts the data clustering process that will orchestrate the clustering

process:

Feature Vectors

Data clustering Agent

Clustering controller Clustered Data

Cluster centroid calculation Assigns feature vectors to clusters

Cluster quality check Persists results

132

A cluster count K is read from configuration data.

The clustering controller will fetch the required vectors. A call is then made to

the PSO-based clustering agent which will derive the K cluster centres and will

place the feature vectors in the K clusters

The clustering controller will run the cluster quality checks to determine how

well the data has been clustered.

The controller then decides to adapt the value of K, to recalculate cluster centres

and re-cluster the dataset or to persist with the results to the clustered data

store.

8.4 Classification

As introduced in Chapter 5, classification is the process of mapping a specific data

element to a pre-determined class. In order to achieve this, the data needs to be

compared to a known set of classes.

When the number of variables in each dataset is very high this becomes a challenge as

the high-dimensionality of data, as discussed in Section 5.4.4, exponentially increases

the processing time required. A linear technique to address this issue was introduced in

Section 5.5.1 namely the Principal Component Analysis (PCA) whereby the most

prominent components are selected for the dataset.

This section will introduce the proposed model for a Genetic Algorithm-based data

dimensionality reduction solution for the classification process that will be defined in

Chapter 11. This process is defined in two parts, namely: Feature vector creation and

calculating the reduced set.

8.4.1 Building the feature vector

As discussed in Chapter 7 the first part of setting up a GA is to determine the

chromosome mapping and in the case of relational data this is done by determining a

feature vector.

133

Figure 8.6 indicates the process of feature normalisation which will normalise the

values of a feature vector and save the class associated with the feature (in the case of a

training data set).

Figure 8.6: Feature Vector normalisation

The feature vector creation normalisation process will make use of the feature vectors

derived from the process described in Section 8.3, but for the clustering process a class

label needs to be assigned to each feature vector. This will be used as training data.

Figure 8.7: Building the normalised feature vector

Figure 8.7 shows the creation of the normalised feature vector. The feature

normalisation agent will select the features from the feature vector store and normalise

the data. It will then persist with the normalised vectors in combination with their class

labels (when available) for the given dataset in the feature vector store which is

exposed via an interface.

Feature Vectors

Feature Normalisation Agent

Mart Interface

Feature vector

Normalised Feature vector with class label

Feature Vectors

Feature normalisation

Normalised Feature Vector

Feature Class

134

8.4.2 The classification process

As discussed in Section 5.5 dimensionality reduction is the process of either selecting or

extracting an optimal set of features from that dataset that will still accurately describe

that dataset. Formally, it can be defined as: given a set of A dimensional input patterns,

produce a transformed set of patterns B, where B<A.

This is achieved by a transformation matrix which will map A to B, thus the goal of the

dimensionality reduction algorithm is to discover this mapping matrix.

The goal of the GA is to produce a transformation matrix that will ensure that the

original set of features gets mapped to the reduced set. The GA must maintain a

population of competing feature transformation vectors which will map the original

features to a reduced set.

Figure 8.8 depicts the process through which the feature vector serves as an input

pattern A to the GA. The GA will build and evolve the transformation vectors 푯 .. 푯

where z is the population count of the GA. The new set of features, B, is calculated from

the chosen transformation vector 푯 and fed to the classifier.

Figure 8.8: Building the transformation vectors [Far07]

The classification process that is driven by the classification controller is shown in

Figure 8.9:

The value of the K neighbours to examine by the classifier is read from the

configuration data.

Input Patterns

A

Transformed patterns

B = 푨 ∙ 푯 Classifier

GA with transformation vectors 푯 .. 푯

135

The classification agent retrieves the training data set with class labels as

normalised feature vectors from the feature vector store. It also retrieves the set

of vectors that has to be classified.

The classification agent maintains a set of weighted transformation vectors that

it uses to transform the dataset and sends the resultant vectors to the classifier

which will classify the set against a known test set for the dataset.

The controller then performs the classification purity checks and determines if

the GA needs to run again to build a more optimal reduced vector set.

Figure 8.9: The classification process

8.5 Overview of the data mining mart

As discussed in Sections 8.3 and 8.4 there are two types of pre-processed data that are

maintained in the data mining mart, namely clustered and classified data. Once these

processes have been setup for the required tables the triggers will allow the new data to

be calculated and persisted in the store.

Feature Vectors

Normalised Feature vectors with class label as training set

Classification Agent Classifier

Classification controller Classified Data

Normalised Feature vectors to be classified

Transformation vector

Classification quality check persists results

136

Figure 8.10 depicts the data mining mart with the pre-processed data derived from each

process. The cluster centres and feature vectors are transient data required by the sub-

processes and act as triggers for new data changes in the relational database. Once the

trigger conditions are met the transient data is calculated and stored, allowing the

classification and clustering processes to be run to produce new up-to-date pre-

processed data.

Figure 8.10: The data mining mart

8.6 Conclusion

Chapter 8 has introduced the EPS4DM model which incorporates the data pre-

processing tasks of clustering and classification as means of dimensionality reduction.

The EPS4DM model aims to extract the data required from a relational data subset

chosen by the user and produces the required feature vectors that describe the search

space. The clustering and classification tasks are then run on the chosen dataset and the

resultant clustered data as well as the possibly reduced classified set are stored in a

data mining mart. When significant data changes occur on the target data subset, the

agents will trigger the update of the feature vectors and the pre-processing tasks that

are run on the new set to update the data mining mart. This effectively allows the data

mining algorithms to refer to this mart as their repository meaning that they can access

pre-processed data as and when it becomes available. In comparison to the more

traditional approach this allows for a more active approach to data mining than having

to aggregate the data required once the organisational decision is made to pursue data

mining.

Relational data

Clustered data

Classified data

Data mining

mart

Centres

Classes + Reduction vector

137

Chapter 8 served as an introduction to the components defined in the pre-processing

system, namely data extraction, clustering and classification. The data mining mart was

described in this chapter and shows how the transient data is calculated when changes

are made to the relational database that triggers the pre-processed data to be re-

processed.

The following three chapters describe each of the three sub-tasks in more detail.

Chapter 9 describes the data extraction process in detail and depicts the construction of

the feature vectors. Chapter 10 provides a computational and mathematical model for

the PSO-based clustering process. Chapter 11 describes the feature vector creation

process and explains the computational and mathematical models for the GA-based

classification process.

138

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

139

Chapter 9 The EPS4DM’s data extraction process

9.1 Introduction

As defined in the preceding chapters, the traditional method of populating a data

warehouse with production data is to follow the ETL process. In this process data is

categorically extracted, transformed into the appropriate format and loaded into the

target warehouse. This process specifically requires data from multiple tables to be

loaded into a single table via the transformation process.

The inherent problem with this method is that the source tables are subject to a high

degree of one-to-many associations. The traditional transformation processes result in a

loss of accuracy directly proportional to the number of tables and the degree of

associativity between them.

Chapter 8 provided an overview of the EPS4DM model that handles the data extraction

from a relational table. The concept of a feature vector was also introduced. Chapter 9

will provide more detail on this process as well as an overview of how data will be

extracted from multiple source tables in a relational database.

9.2 Mapping relational structures

As discussed in Chapter 2, dimensionality reduction is a phase of the data mining

process in which the size of the target data is reduced in order to allow a rule engine to

effectively run on a smaller dataset. As discussed in Chapters 4 and 5 the relational

database is built on relations. The target format for a data mining algorithm is

fundamentally different as it requires data to be stored in an attribute-value format.

Effectively this constitutes a denormalisation of the original relational structure.

The target table for any specific data mining process will inevitably contain records that

are associated with one or more records in other non-target tables.

140

A high degree of associativity results in a very large target dataset often subject to the

multiple-instance problem (where an instance refers to a row in a table) [Alf07, Ray10].

In the target table, a single record 푅 can have a high degree of associativity with other

records stored in the non-target tables as shown in Figure 9.1.

Foreign Key

R R R R …

R

Figure 9.1: Associativity between target records combined as feature vectors [Alf07].

Let R denote a dataset from the target table and T denote the associated n records in the

non-target table with T = (푡 , 푡 , 푡 , … , 푡 ). Let 푇 be a subset of T; 푇 ⊆ 푇 where it is

associated via a foreign key 푅 to a single record 푅 in the target table. The non-target

table can thus be described as a table consisting of rows of objects, where a subset of the

rows can be linked to a single object in the target table. The vector space model allows

the records associated with a target record to be represented as a feature vector,

informally known as bags of patterns as illustrated on the right-hand side in Figure 9.1

[Alf07, Kuo09, Ray10].

There are two types of data that can be represented in a feature vector, namely numeric

and mixed data. Sections 9.2.1 and 9.2.2 provide an overview of each.

9.2.1 Numeric data

When the target data in question is numeric in nature the feature vector is constructed

directly from the data. The features of the target table in question are aggregated and

form the initial feature vectors X:

푿 = {풙 ,풙 ,… ,풙 }

Each feature vector 풙 , is of size N, where N is the number of features in the dataset.

T T T T T T T

T T T

141

When numeric data is aggregated this data will be normalized in the range [1.0, 10.0]

where 푤 is a weight associated with the feature 풙 . The normalisation formula is used

to perform the weighting of each feature:

푥 =푥 −푥

푥 −푥 ∗ 9 + 1

When dealing with numeric data only the weighted feature set must be used. The

reasoning behind this is the exposure of the outliers in the data, because when features

are closely bundled together the differentiation between classes is more difficult.

Figure 9.2 depicts data points of three different classes: the image to the left is mapped

as normalised values whereas the image on the right contains a weighted normalised

data set [Ray00].

Figure 9.2: The effect of feature weights to expose outliers [Ray00]

9.2.2 Mixed-type data

When the target table contains non-numeric values such as Boolean or type descriptors,

the normalisation process described in Section 9.2.1 does not apply. The feature vector

is constructed by representing each value as an integer that represents the index of each

feature from the target table. Figure 9.3 represents a feature vector that contains a mix

of numeric, string and Boolean data represented by the feature indexes [Far07, Pei98].

The problem of multiple instances arises when tables are denormalised. This may result

in more than one unique record in the dataset, which essentially results in a larger

dataset to search.

142

The following section provides a method to deal with multiple instances for a feature.

1 35 21 12 8

Figure 9.3: Example feature vector with indexes to complex types

9.2.2 Handling multiple instances

In a simplistic database, a target record 푅 can be associated with a non-target record

that has a single attribute. In this case the single associativity approach can be used in

which the feature generated contains only one attribute mapping [Alf07, Atr, Ray10]. As

seen in Figure 9.4, the non-target Income_Stats table contains age, annual income, a

foreign key for the country and an income class.

Income_Stats Age Annual

Income Country Income

Class 25 75k 1 1 27 85k 2 2 25 75k 1 1

Age Income Country Mapping Income Class 25 75k 20 1 27 85k 11 2

Figure 9.4: Example of a feature pattern being built

The cardinality of the non-target domain needs to be determined first, which constitutes

the amount of unique values that a specific attribute can take. In this case the cardinality

is two as there are two options available in the Country column. The non-target

sequences need to be encoded as binary numbers, in order to ascertain the appropriate

number of bits required.

Country ID Country 1 USA 2 UK

land num_access_file

service count dst_host_name

143

The following can be derived for n:

2 < 퐹푒푎푡푢푟푒퐷표푚푎푖푛 ≤ 2

Calculating from the data presented in Figure 9.4: 2 < 2 ≤ 2 ; n=1. This results in

the following binary mapping: (0, 1) representing the country options USA and UK. The

pattern for each instance of the feature is maintained for the non-target table, a counter

is used for each FK entry, either incrementing the count of an existing pattern or adding

the pattern to the vector if it does not already exist. The number prefixed to each

pattern, represents the frequency of its occurrence. Referring to Figure 9.4 it can be

seen that the two entries for the duplicate income class 1 entry have been grouped into

one record [Alf07, Atr, Ray10].

Calculating the number of bits required to represent the feature instances is crucial as it

is used to determine the counter value of the representation. In the simple example

given, the duplicate entry for country USA has been allocated a country code of 20

where the 0 represents the country feature and the 2 the instance count. The n number

of bits is 1 hence the last n bits of the mapping represents the feature, so the remaining

1 bit is used as the counter. If the value were to be 1000 this would imply the first 3 bits

are used for the counter, hence there would be 100 instances of feature 0.

9.2.3 Multiple instances with more than one feature

A real-world dataset will most definitely have a non-target table that contains multiple

attributes. A similar approach to the process described in Section 9.2.2 is followed, but a

feature mapping is built for each referenced feature.

9.3 Process overview

As described in Chapter 8, Figure 9.5 provides an overview of how the agents in the

EPS4DM model will handle the data extraction process, which is the first step of the pre-

processing process.

144

Figure 9.5: Data extraction process

9.4 Conclusion

Chapter 9 has provided an overview of the proposed data extraction and

denormalisation processes used by the EPS4DM model. This chapter has shown that the

data required by both the clustering and classification tasks can be derived from a

relational table by means of building a feature vector.

The nature of the data in the target table will determine the type of feature vector that

can be built. There are essentially two types of feature vectors, namely numeric and

non-numeric data. When numeric data is used as input to build the feature vectors the

data will be normalised in order to expose outliers. In the case of non-numeric data two

options are presented, namely to use the index of each feature as the value or when the

dataset could result in multiple instances to use the pattern approach to define the

feature. By making use of the pattern approach the dataset can initially be reduced as

multiple instances of the same record that are aggregated.

As discussed in Chapter 8 the feature vectors are saved in the data mining mart. The

pre-processing tasks of clustering and classification that both rely on the feature vector

will be described in Chapters 10 and 11.

Relational data

Data extraction agent

Data transformation

agent

Data change

Pass relational data

Vector store

Feature Vectors persisted

145

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

146

Chapter 10

EPS4DM’s clustering process

10.1 Introduction

Chapter 9 provided an overview of the way in which the feature vectors are derived

from a relational database. These vectors are then persisted in the data mining mart.

Chapter 10 will provide an overview of K-Means clustering, shows how the PSO-based

clustering algorithm based on [Sou04, Kom11] is derived and gives an overview of

cluster quality measures. K-Means is a local minimisation method which means it relies

on its view of the K nearest neighbours in order to determine which cluster a feature

vector belongs to. Local minima are often found which results in feature vectors from

more than one class being grouped together. The PSO-based clustering algorithm

attempts to avoid these local minima by making use of the information provided by the

swarm as a whole as opposed to the K nearest neighbours.

This chapter provides an overview of the way in which cluster centres are determined,

the way in which distances between clusters and feature vectors are calculated and an

in-depth explanation of the PSO algorithm used to cluster the data.

10.2 Determining cluster centres

The basic concepts relating to clustering was described in Chapter 5. As discussed in

that chapter, the first part to producing a clustered dataset is to find cluster centres.

These form the base reference points from which other datum points are grouped into

clusters. Finding cluster centres is considered an NP-hard problem [Gho06]. This

section will give an overview of the K-Means algorithm that forms the basis of the PSO-

based technique.

10.2.1 Traditional K-Means algorithm to find cluster centres

K-Means makes use of the squared 2-norm distance (more commonly known as

Euclidean distance) between points in order to find the minimised cluster centres. An

important aspect of the algorithm is that the number of clusters and the initial cluster

147

centres need to be chosen before the algorithm commences [Deb06].

Let 푿 = 풙 , 풙 , … , 풙 be a subset of points in the dataset, with K being the number of

clusters we wish to produce as 푾 = 흎∗ ,흎∗ , … , 흎∗ ; with the goal being to minimise:

푄(푾) = 풙 −푐

∈흎

Where Q(W) is the summation of the norm of the distance between each datum point

and its cluster centre. Each cluster j has 푐 defined as the centre for that cluster which is

the centroid for cluster j :

푐 = 1푛 풙

∈흎

with 푛 being the number of points in cluster j [Deb06].

The goal is to find K clusters, with a cluster defined as 흎∗ which minimises the distance

between each datum point and its cluster centre based on Euclidean distances [Deb06].

The K-Means algorithm is summarised in the next section by providing its initial

conditions and computational steps.

i. Initial conditions

Step 1 – An arbitrary partitioning of the data 푾 for K clusters is made and the initial

cluster centres are defined. For each point the cluster centre with the closest Euclidean

distance is found. If a tie occurs between two clusters, it is resolved via a random

assignment to one of the cluster centres.

ii. Computational steps

Step 2 – A new set of cluster centres are calculated from the initial set by computing the

centroid for each cluster represented as C(t). The centroid is the point that minimises

the total distances of all points in that cluster to itself and is defined as [Deb06]:

148

퐶(푿) = ∑ ‖풙 −푐 ‖

푛

Step 3 – If a stopping condition is met, then the final partitioning is represented by

푾(푡)and the final cluster centres by 푪(푡). If not, t is incremented by 1 and steps 1 – 3

are repeated.

A common stopping condition is to calculate the difference between successive

solutions for Q and test if it falls within an acceptable error tolerance range (휀):

푄(푡) −푄(푡 − 1) ≤ 휀

K-Means is a local optimisation algorithm designed to optimise clustering of data based

on the squared Euclidean distance between a point and its closest cluster centre [Ye05,

Deb06].

Section 10.3 provides an overview of the PSO algorithm used in the EPS4DM model to

cluster data.

10.3 PSO clustering algorithm

As discussed in Chapter 7 Particle Swarm Optimisation (PSO) is a stochastic search

algorithm that is inspired by the collective intelligence found within swarms, such as a

flock of birds or a shoal of fish. Particles move around within an n-dimensional search

space of the problem domain, each searching for increasingly better solutions. The

application of the technique and adaptations required to search for an optimal feature

subset are described in this section.

10.3.1 Search space

Let S be the dataset with 푿 = 풙 , 풙 , … , 풙 a subset of points in S, with K being the

number of clusters to be produced as 푾 = 흎∗ ,흎∗ , … , 흎∗ . For X the ith element 풙 is

defined as a vector in the n-dimensional search space with 푛 ≥ 0 representing the

dimensionality of the original dataset S [Che04, Sou04].

149

10.3.2 Euclidean distances and centroid calculations

As discussed in Section 8.3.1 the relative Euclidean distance and centroid formulae are

required to determine cluster centres and the distances between data points and their

nearest cluster centre.

Cluster centres are determined by using the following centroid formula [Sou04]:

풛 = 1푛 풙

∀풙 ∈

풛 defines the centre vector of cluster 푗 for the space containing 푛

particles.

풙 is representative of the rth data vector of S.

풄 is representative of the subset of data points making up cluster 푗

The relative Euclidean distance between particles and their nearest cluster centre is

defined in the following way [Che04, Sou04]:

퐷 풙 , 풛 = 풙 − 풛

풙 is representative of the rth data vector of S.

풛 is the centre vector for cluster 푗 calculated using the formula above

d represents the dimensionality of each centre vector 풛 and is effectively

the number of features for the vector.

Equation 10.1 – Calculating Cluster Centres

Equation 10.2 – Relative Euclidean distance

150

10.3.3 Initial population and particle encoding

A single particle in the search space represents a vector as a possible solution for 퐾

cluster centres with a dimensionality n in the space S. Thus particle 풑 = 풛 ,풛 , … ,

풛 where 풛 is a possible cluster centre [Sou04, Kom11]. It is recommended that

3 ∙ 퐾 ∙ 푛 particles are chosen for the swarm population [Sou04].

Figure 10.1 represents a sample particle configuration with 푛 = 2 as the dataset’s

dimensionality and 퐾 = 4 as the anticipated number of cluster centres. Each particle

maintains 4 cluster centres as well as the 2-dimensional vectors v(2) of the search space

S which are part of that cluster [Qu10]. The initial particle positions and velocities

representing the cluster centres are chosen at random. The details of this process are

described in the next section.

Cluster Center 1 Vectors v(2) ∈ S




Figure 10.1: PSO particle representing a solution in 2 dimensions with 4 clusters

10.3.4 Particle velocity and position update

As it has been stated in Section 7.2.1 (see page 115), each particle’s velocity is governed

by a cognitive and social component; and particles are moved in the n-dimensional

hyperspace by accumulating the current velocity vector to the previous position vector.

In chaotic PSO, the 풓ퟏ and 풓ퟐcomponents are replaced by a random component to

ensure that particles can recover from local minima [Che04, Yan11].

퐶푟 = 4.0 × 퐶푟 × (1−퐶푟 )

The chaotic sequence variable is generated for each run of the algorithm t, bounded to

the following conditions [Yan11]:

Equation 10.3 – Chaotic sequence variable

151

퐶푟 = 0.001

퐶푟 ≠ {0, 0.25, 0.5, 0.75, 1.0}

퐶푟 ∈ (0, 1)

As discussed in Section 7.2.1 (see page 115), the velocity of each particle is updated per

social and cognitive component:

풗(푡 + 1) = 푤풗 (푡) + 풄 (푡) + 풔 (푡)

The velocity update occurs for particle 푖 in dimension 푗 of the domain at time interval 푡

with 푐 and 푐 being stochastic components in the range [0,1].

10.3.5 Fitness evaluation for particles

An optimal fitness function is critical to ensure both the convergence and solution

quality of the swarm. The relative Euclidean distance can be used to find the aggregate

distances of all data points in the cluster to their closest cluster centre. An acceptable

error tolerance can be decided upon to allow the swarm to converge. The fitness of each

particle that must be compared to this tolerance value is calculated using:

푓푖푡푛푒푠푠 = 퐷 풙 ,풛

The fitness for a particle can be determined by using the summation of the norm of the

distance between each datum point and its cluster centre; with 푖 = 1, … ,퐾; 푗 = 1, …푛. K

is the number of clusters, n is the number of datasets, Z is the cluster centre and X is the

data point in consideration [Sou04, Kom11].

10.3.6 PSO algorithm to calculate cluster centres

The PSO algorithm as adapted from [Kom11] needs to initialise the K cluster centres.

Equation 10.4 – Particle Velocity update

Equation 10.5 – Fitness function

152

This is done by assigning each vector in the search space S as defined in Section 10.3.1

to a random cluster centre. A random local best fitness is also assigned. The following

steps are then repeated for each particle for a set number of iterations [Kom11]:

begin

Randomly initialize particle swarm

Randomly generate Cr(0)

repeat

Compute the fitness value of particle as per Equation 10.5 for each

particle in C

Compare the fitness of the particle to its local best solution and set

the value of local best to the current value if fitness is better

Compare the fitness of the particle to the current global best found

and set the value of global best to the current value if fitness is

better

Compute the new particle velocity and position

Compute the Euclidean distance between each data vector and its

nearest cluster centre as per Equation 10.2

Cluster the data vectors by assigning them to their closest cluster

centres

Recalculate cluster centres as per Equation 10.1

until stopping criterion is met

end

153

10.4 Measuring cluster quality

The measure of cluster quality is dependent on two factors: 1) cluster density (i.e. the

closeness of each data point to its cluster centre) determined by using the final value of

the fitness evolution for the global best particle; and 2) the number of points that were

misplaced in the clustering process.

In order to calculate the number of misplaced points, a relative error rate can be

calculated by counting the occurrence of the post-clustered points being in the same

position as the pre-clustered points. An example of the latter is the counting of the

number of data points that were not moved at all (these refer to outliers which could

not be placed in a cluster – as mentioned in Section 5.3).

푒푟푟표푟 = 풆 1푖푓푣푒푐푡표푟ℎ푎푠푛표푡푚표푣푒푑0표푡ℎ푒푟푤푖푠푒

풆 represents the ith data vector of set S and is treated as a possible error

vector

As clustering and classification are unsupervised processes, the resulting dataset is not

guaranteed to contain data from the same class of objects. A measure of supervision

needs to be introduced to the algorithm, and this is referred to as cluster impurity11. On

a high level, a set of clusters will be produced and then a new goal is introduced, namely

to minimise the linear combination of cluster dispersion12 and cluster impurity for the

specific dataset.

Select 퐾 > 2 cluster centres from the computed set of clusters as 2푚 (푘 = 1, … ,퐾) to

minimize an objective function with weighted impurity and dispersion metrics:

min, ,..,

훽 ∗ 퐶푙푢푠푡푒푟_퐷푖푠푝푒푟푠푖표푛+ 휕 ∗ 퐶푙푢푠푡푒푟_퐼푚푝푢푟푖푡푦

where 훽 > 0; 휕 > 0 are positive real parameters.

The algorithm can be adjusted to be a pure unsupervised learning algorithm by setting

휕 = 0 resulting in only dispersion to be minimised. 11 Cluster impurity is the sum of data points within the cluster which actually belong to another cluster center. 12 Cluster dispersion is the sum of the differences between the vectors and the cluster center.

154

As discussed in Section 5.4.3 with the K-Means algorithm, points are assumed to belong

to their closest cluster centres based on the Euclidean distance.

Non-empty clusters are assigned labels based on the majority class of points that belong

to that cluster [Blo03, Dou08].

10.5 Process overview

The PSO algorithm will effectively produce a set of cluster centres for the data that was

denormalised in the extraction process overviewed in Section 9.2. The Euclidean

distance between each data point and its cluster centre is used to determine which

cluster the data point belongs to. The PSO clustering algorithm will then firstly use this

transformed data to calculate K cluster centres for the data and then to make use of the

centres to cluster the data. The resultant clustered data sets are then persisted in the

data mining mart.

10.6 Conclusion

Chapter 10 has discussed the PSO-based approach as derived from [Sou04, Kom11] to

cluster feature vectors. As discussed in Chapter 8 the clustering controller allows the

clustering agent to retrieve the set of feature vectors from the data mining mart and to

run the clustering process on the retrieved set.

The PSO-based clustering agent maintains a population of possible cluster centres and

the feature vectors assigned to each cluster. Each particle represents a full solution to

the clustering problem. Initially feature vectors are randomly assigned to clusters and

the cluster centres are calculated based on that assignment. The swarm then makes use

of the fitness function in order to minimise the summation of the norm of the distance

between each cluster and its assigned feature vectors. At each iteration new cluster

centres are determined based on this fitness value, feature vectors are re-assigned to

cluster centres and the fitness value is again calculated to determine if better clusters

have been formed.

155

Making use of a PSO algorithm as opposed to the K-Means algorithm means that

possible local minima could be avoided as the swarm guides each candidate solution.

The clustering controller can then make use of the error rate to determine whether

sufficiently good clusters have been formed. Based on its decision it can re-run the

process or store the resulting clustered data in the data mining mart.

In terms of the EPS4DM model the clustering process described in this chapter provides

the system with a mechanism to perform the clustering tasks on a set of feature vectors.

Chapter 11 will provide the details of the GA-based classification sub-system.

156

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

157

Chapter 11

EPS4DM’s classification process

11.1 Introduction

Chapter 10 provided an overview of the way in which the PSO-based classification sub-

system makes use of the data in the data mining mart in order to cluster the selected

data. This data is then persisted in the data mining mart. Both the clustering and the

classification tasks require feature vectors as their input. However as will be discussed

in Chapter 11, the classification task requires a training set which also needs a set of

class labels associated with a feature vector [Can06].

This chapter will provide the details pertaining to the GA-based classification system as

derived from [Pei98, Far07]. The classification process makes use of a scalable K-

Nearest Neighbour (KNN) classifier that requires as input a training data set with

associated class labels as well as a testing set without class labels. The KNN classifier

attempts to classify the feature vectors based on the training set [Can06]. The purpose

of using a GA is to produce a transformation vector that could essentially reduce the size

of the search space by assigning weights to each feature and then running the

transformed set thought the KNN classifier.

Chapter 11 will also give the details of the input data, the chromosome mappings, the

mutation and crossover operators, the implementation of the GA and its parameters as

well as a KNN classifier that is used to assess the quality.

11.2 The GA Feature extractor

In order to gain an understanding of the construction of the GA and its components the

GA feature extractor will be broken down into its constituents. This process involves a

training data set x (if available) and a KNN classifier that will run a KNN rule to evaluate

the fitness of the transformed data set.

The feature extraction’s end goal is to produce a reduced data set for the classifier.

Introducing classification decision feedback to guide the feature extractor is a good

158

approach and is the approach that will be followed in the EPS4DM model. Figure 11.1

provides an overview of this GA/KNN approach and each operation will be discussed

[Pei98].

11.2.1 The weight vector

As seen in Figure 11.1, the initial population comprises a random initialisation of the

population and the weight vector. The weight vector 풘∗is defined to be an N = |x| vector

generated by the GA also defined as an NxN diagonal matrix W.

This matrix is used to multiply each feature vector x to produce a new vector y

according to a mapping function푀(풙). This equates to the linear transformation seen in

Figure 11.1.

11.2.2 Linear transformation matrix

풚 = 푀(풙) = 푊(풙)

푾 =

⎣⎢⎢⎢⎡풘 0 ⋯ 00 풘 ⋯ 0. . . .. . . .0 0 0 풘 ⎦

⎥⎥⎥⎤

or 풘∗= [풘 , 풘 , . . . ,풘 ], x = [풙 , 풙 , … ,풙 ]

The transformation matrix can be used on two types of data, namely numeric data and

complex data.

When complex data types that contain indexes to features are required, the weight

vector is a Boolean bit mapping. Thus 풘 ∈ {0,1}, 1 ≤ 푖 ≤ 푁 and is interpreted that if

the i-th weight is a 1 then that feature is deemed part of the feature vector.

When normalised numerical data is used the algorithm will search for a relative weight

over the defined normalised values. Therefore, 풘 ∈ [0.0, 10.0], 1 ≤ 푖 ≤ 푁. This implies

that features are being selected in a linearly transformed space and that those elements

that are converging towards 0 are seen as not important and will eventually be

removed.

The resulting weights indicate that a particular feature is useful and this is referred to

159

as its discriminatory power. The approach of linear transformations is only applicable

when the data patterns or classes are linearly separable [Pei98].

Initialise population (random weights assigned

풘∗

풚 = 푀(풙) = 푊(풙)

Evaluate by KNN Rule

Satisfy

Selection

Crossover

Mutation Best set of weights 풘∗

No

Yes

New population

Figure 11.1: GA/KNN feature extractor [Pei98]

160

11.3 Chromosome mapping

As discussed in Chapter 7, the GA requires an appropriate representation of the search

space by way of a chromosome mapping. Chapter 9 described how feature vectors are

constructed from the relational tables. These vectors contain the information required

for the chromosome.

The feature vectors form the basis of the chromosome mappings that will be used.

When numeric data is used, the chromosomes contain real valued weights as depicted

in Figure 11.3. When indexed data in the chromosome will make use of bit masking in

order to flag the features as shown in Figure 11.2.

0 1 1 1

Figure 11.2: Bit mask used for chromosome mapping [Ray00]

2.5 3.5 … 7.8 1.9

Figure 11.3: Feature weight used as chromosome mapping

To maintain maximum flexibility, additional bit masks will also be conceived which

allows complex data types to be flagged as normalised data or other data [Ray00].

11.4 Crossover operator

As discussed in Chapter 7, the GA makes use of a crossover operator to splice up the

genes of the parents in order to create children to fill the population. The feature

vector’s data determine the type of data in the chromosome – either real valued weights

or bit values.

Regardless of the data type these will be used to create the features of the children.

Feature 1 is excluded from the classifier Feature 2 is included in the classifier

161

Figure 11.4 indicates a random crossover point that is selected from each parent and

new feature weight vectors are created as children [Eng07, Far07].

Figure 11.4: Crossover operation of mapped feature vectors [Eng07]

11.5 Mutation operator

As discussed in Chapter 7, the mutation operator is employed to introduce diversity into

the population to allow new feature vectors which were not previously there to be

created as possible solutions. This diversity allows the final algorithm to converge to the

optima.

The mutation operator will need to know what type of allele it is dealing with and thus

the bit masking discussed earlier is utilised. If indexes are used then values are mutated

to valid index values, if normalised data is used then the mutation operator needs to

randomly select a valid value from this data to replace into the chromosome.

The mutation operator cannot introduce an index or a numerically mapped value that

does not exist as this will not map to a feature. Figure 11.5 depicts a valid mutation

operation on feature indexes [Far07, Eng09].

3 22 35 7 19

2 11 21 3 41 12 1 27

2 11 21 35 7 19

3 22 3 41 12 1 27

Parents

Children

162

Figure 11.5: Mutation operation of a single feature vector [Eng07]

11.6 Selection strategy

The GA requires that population members are carried to the next iteration of the

algorithm. These are the members that will be subject to the mutation and crossover

operators in the next generation. The selection strategy that will be used for the GA is

fitness proportionate selection (also known as roulette wheel selection) combined with

the elitism13 strategy [Deh08].

11.6.1 Roulette wheel selection

A fitness value is assigned to each chromosome feature vector after the GA operators

have completed the crossover and mutation.

This fitness value is then used to create a probability p of each chromosome to be

selected, where 푓 is the fitness of chromosome i and N is the population count.

푝 = 푓

∑ 푓

This probability effectively normalised the selection of the chromosomes to 1 and in a

similar fashion to how a roulette wheel spins in a casino, a random selection is made on

this probability to see which candidates survive. Chromosomes with a higher fitness are

more likely to be selected but there is also a chance that less fit members makes it

through.

The less fit chromosomes that survive provide an advantage as their individual genes

could contain possible valuable solutions for the generations ahead [Eng07].

13 Elitism allows some of the best solution from the current generation to be promoted to the next one.

3 22 35 7 19

3 22 6 7 12

Parent

Child

163

11.6.2 Elitism strategy

In order to maintain good solutions, chromosomes with a very high fitness level will be

promoted to the next generation unaltered in order to determine in the next round if

their gene constructs also produce a high fitness. Some chromosomes might survive

unaltered for a few generations only to be mutated later [Eng07].

11.7 KNN rule and training data

As discussed in Chapter 5, the purpose of classification is to label a particular set of data

as part of a class. The KNN rule requires the class labels in order to perform the nearest

neighbour comparison. Training data is required as input to the KNN classifier in order

to measure classification error rates. After appropriate chromosomes have been

generated their fitness needs to be assessed, and this is where the KNN classifier comes

in.

11.8 Fitness function

The fitness evaluation of a feature vector is done by making use of a classifier. This

classifier maps data from a so-called pattern space (initial features) to a feature space

(final selected features) and then to a classification space (which are the class labels).

Thus

퐺 = 푿 × 푾 → 풀 → 퐾

where X is the pattern space, W the weight vector (selected features), Y the feature

space (reduced pattern space) and K the set of class labels.

The classifier needs to be trainable according to a function

퐽(풘∗):푊 → ℝ

where ℝ is in the domain of real numbers. Therefore, the process that will be followed

find a weight vector 풘∗ ∈ W in order to minimize 퐽(풘∗).

164

11.8.1 Scalable classifier

A scalable classifier is defined as one where, if the i-th feature is changed in the input

matrix X and provided with the suitable vector 풘∗, the results are equal. Thus if

x = [풙 ,풙 , … , 풙 , … ,풙 ], x’ = [풙 ,풙 , … , 풙 ′, … ,풙 ]

and the element 풙 ≠ 풙 푡ℎ푒푛퐺(풙,풘) = 퐺(풙 ,풘)

G is called a scalable classifier [Pei98].

11.8.2 GA/KNN fitness function

As depicted in Figure 11.1, a GA is used to train a KNN scalable classifier to obtain the

optimal weight vector 풘∗. The function J defined in the previous section is used as a

measure of the error rate of the classifier (the ability of the classifier to mark a pattern

as part of a class).

Each weight vector 풘∗ that is generated is used to create a new feature vector y which

makes up a new possibly reduced feature space. The KNN is run in this space computing

the weight 퐽(풘∗)for each new vector y while the number of miss-classifications are

stored. The 풘∗vector that produces the fewest number of miss-classifications will be the

optimal transformation matrix for the space.

The miss-classification tolerance must be based on the error-rate that will be derived

from the training data set and thus fitness is defined as

퐹푖푡푛푒푠푠 = 퐽(풘∗) = ( )

This fitness function however merely examines the performance of the weight vector to

correctly classify patterns based on the performance of the test data set. It does not take

into account the inter-class separation of the labelled patterns, thus the introduction of

Neighbour Awareness14 (NA) forms part of the fitness function defined as:

14 Neighbour awareness refers to the GA being aware of the vectors outside of a specified class.

165

퐹푖푡푛푒푠푠 = 퐽(풘∗) + 푁퐴 = 훾 ( )

+

훿 /

where 푛푀푖푛 is defined as the number of nearest neighbours which were not labelled as

part of a class and K is the total number of nearest neighbours. 훾 and 훿 are dynamic

tuning parameters and their values will be discussed in the GA parameter section

[Pei98].

11.9 GA parameters

The selection of the GA parameters is explained in Table 11.1 [Pei98, Far07].

Table 11.1: GA parameters

GA Parameter

Crossover rate

Mutation rate

Population size

Generations 훾 (Fitness) 훿 (Fitness)

0.8

0.1

100

50

2.0

0.2

The results of the GA/KNN classifier as presented in Chapter 12 were used to

empirically derive the parameters in Table 11.1.

11.10 Conclusion

Chapter 11 has explained the specifics of the GA/KNN classifier process that will be

used in the proposed EPS4DM model as derived from [Pei98, Far07]. The process makes

use of the feature vectors as derived in the process described in Chapter 9 but requires

two distinct sets of data, namely a training set with class labels and a test set without

class labels. The goal of the GA is to produce a transformation vector that can be applied

166

to the original set. This highlights certain features as less or more important than others

based on the assigned weights. The GA maintains a set of these transformation vectors

which it used to derive a new set of feature vectors to send to the KNN classifier. The

fitness evaluation of the GA is in essence the KNN classifier.

By combining the GA process of finding the optimal features with an actual classifier the

fitness function drives the generation of chromosomes that will contain the fewest

number of discriminatory features to still correctly label items from the test set. As

discussed in Chapter 8, the classification controller will obtain a single instance of the

search space as a set of training and testing feature vectors. The classification agent

evolves transformation vectors that are used to produce a new feature set to send to the

classifier. The controller can then decide, based on the results of the fitness function, to

evolve better transformation vectors. Otherwise, if the best transformation vector for

the dataset is found the controller will persist with the vector as well as the final

classified dataset to the data mining mart.

Chapter 11 has provided the details of the mechanism that will be used in the EPS4DM

model to perform data clustering as well as the potential reduction of the size of the

search space by highlighting which features are discriminatory in terms of their

weighting.

Chapter 12 will provide an overview of the implemented EPS4DM model and will

present the results of the classification and clustering processes. The chapter will be

divided into three sections, namely: an overview of the implementation, the results of

the PSO-based clustering algorithm from Chapter 10 and the GA-based data reduction /

classification algorithm presented in this chapter.

167

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

168

Chapter 12

Implementation and Results

12.1 Introduction

Chapters 8 to 11 presented an overview of the data pre-processing system envisioned

that will orchestrate the pre-processing tasks. Chapter 9 explained in what way data

will be transformed into the correct format, Chapter 10 introduced the PSO-based

clustering process, and Chapter 11 gave an overview of the GA/KNN-based classification

process.

This chapter will be divided into three main sections, namely: the Java based

implementation, the results of the PSO-based clustering algorithm that was discussed

Chapter 10, and the GA-based data reduction / classification algorithm presented in

Chapter 11.

12.2 Java based implementation

The EPS4DM model that was discussed in Chapters 8 to 11 was implemented in Java 1.6

running on the JVM 1.6.32 and making use of the Spring framework and Hibernate for

DB mappings. All PSO and GA code was written in pure Java without the aid of any

additional frameworks. This section will provide an overview of the data

denormalisation methods implemented, as well as the PSO clustering and GA/KNN

classification methods used.

12.2.1 Data denormalisation and helpers

The data retrieval and denormalisation processes are provided by means of helper

classes that connect to the required tables in a MySQL database and perform extract and

transformation tasks. The format of the data differs for clustering and classification

tasks. Figure 12.1 illustrates the CalcHelper as well as the DataReader classes that

perform the read tasks and provided methods to calculate the Euclidean distances.

The Iris dataset was chosen as an UCI standard dataset to assess the feature vector

169

classification. The dataset is multivariate, contains real values with four features and

has three class labels (Addendum A contains the details of the dataset). It is an

extensively researched set and provides a good comparison for the algorithm’s

performance.

An example input feature vector format for the Iris dataset in a clustering process is

provided in Table 12.1, listing three feature vectors with four features.

Table 12.1 Clustering data sample from Iris dataset

Sepal length

Sepal width

Petal length

Petal width

5.1 3.5 1.4 0.2

4.9 3 1.4 0.2

4.7 3.2 1.3 0.2

An example input feature vector format for the Iris dataset in a classification process is

given in Table 12.2, listing three feature vectors with four features and a class

denomination for each of the three classes in the set.

Table 12.2 Classification data sample from Iris dataset

Sepal length

Sepal width

Petal length

Petal width

Class

4.7 3.2 1.3 0.2 1 (Setosa)

6.9 3.1 4.9 1.5 2 (Versicolor)

7.1 3 5.9 2.1 3 (Verginica)

12.2.2 PSO-based Clustering

The PSO-based clustering technique as discussed in Chapter 10 was implemented by

way of utilising the standard components of the global-best PSO algorithm and adapting

it to optimise cluster centres and not just numerical values.

170

Figure 12.2 provides an overview of the classes required to perform clustering. A

Particle is extended to a ClusterParticle which contains as data a set of Cluster

centres. Each Cluster contains all the elements as a DoubleArray of the feature vector

created in the format presented in Section 12.2.1.

Figure 12.1: Helper classes within the EPS4DM prototype

CalcHelper also has methods to calculate the cluster means to determine cluster

centers for the elements which have been assigned to it. The PSOClustering class is the

PSO algorithm implementation that contains a set of ClusterParticle’s as its swarm.

Each ClusterParticle represents a full solution of the search domain – thus it contains K

clusters that are assigned elements from the input set. The class also contains a method

to save the resultant global best solution to the data mining mart.

An example of the format in which the clusters are saved is provided below. This

illustrates three clusters with two feature vectors of dimension four assigned to them:

[5.1,3.5,1.4,0.2] [5.4,3.9,1.7,0.4]

[6.7,3.0,5.0,1.7] [7.3,2.9,6.3,1.8]

[4.9,2.4,3.3,1.0] [5.2,2.7,3.9,1.4]

171

Any algorithms or systems that require this data can retrieve it via an interface and

make use of the clustered data.

Figure 12.2: PSO clustering classes within the EPS4DM prototype

The following section provides an overview of the classes required to perform the

classification.

12.2.3 GA/KNN Classification

The GA/KNN classification process as discussed in Chapter 11 was implemented by

making use of a standard GA’s components as well as a KNN classifier. A KNN classifier

was created as per Figure 12.3 which takes as input a training data set in the format

172

described in Section 12.2.1 for classification. This data set has a class label. It also takes

a test data set that is a similar dataset without a class label. The KNN class also

contained a method to perform the normalisation techniques that were discussed in

Chapter 11 which maps the input values to a range [1.0,10.0]. Added to this it contains a

method for performing the nearest neighbour search function in order to assign class

labels to the test data set.

The GA requires an appropriate candidate on which to perform the mutation and cross-

over operations. The ClassificationCandidiate class as seen in Figure 12.3 provides

this functionality by maintaining the 풘∗weight vector defined in Chapter 11 as its

genotype. It also contains a selection mask that is used when working with mixed-type

data. Furthermore, it includes a KNN classifier that is used by its fitness function. It

makes use of the weight vector or selection mask to create a new dataset to run thought

the KNN classifier.

The GA algorithm was implemented in the class GAClassification as can be seen in

Figure 12.4. It contains a set of ClassificationCandidates as its population each of

which represents a weight vector or selection mask that can be used to reduce the

original dataset. It also contains a method for saving the weight vector or feature mask

in combination with the classified dataset to the data mining mart.

The following section describes the hardware-based infrastructure used to perform the

test cases.

12.3 The test setup

The machine used for these tests is specified in the following way: AMD Athlon X2 555

(Dual Core) running at 3.2GHz (6MB cache) with 8GB of DDR3 667 MHz memory on

Windows 7 Ultimate 64-bit. The system also has a SATA II Solid State Drive (SSD) on

which the JVM as well as the RDBMS was running. However, the EPS4DM system can

scale to a platform which supports distributed computing. Also it should be noted that

both the PSO and GA algorithms could also be re-worked to operate in a General

Purpose-Graphics Processing Unit (GP-GPU) based system for increased performance.

173

The following two Sections 12.4 and 12.5 will discuss the results obtained from the PSO-

based clustering and GA/KNN classification processes.

12.4 PSO-based clustering results

As discussed in Chapter 10 the clustering process’s end result is to store the clustered

data in the data store. To demonstrate the PSO algorithms ability to perform the tasks

required by the EPS4DM’s clustering process, the results are reported on in Sections

12.4 and 12.5.

Figure 12.3: KNN classifier and GA chromosome classes used within the EPS4DM

prototype

174

Figure 12.4: GA algorithm and result classes used within the EPS4DM prototype

12.4.1 PSO parameters

Sections 10.2.3.1 and 10.2.3.2 described the setup parameters for the PSO-based

clustering algorithm. For these experiments the following parameters were used: 푐 , 푐

= 1.5; K = 5 and w = 0.1.

12.4.2 Cluster centres

In order to gauge the PSO algorithm’s ability find cluster centres, a randomly generated

dataset of five spherical clusters were generated with 50 points each, adding up to 250

data-points as can be seen in Figure 12.5. The K-Means algorithm was run on the set and

175

the resultant cluster centres were noted. Then the PSO algorithm was run to determine

whether it also found the centres. As can be seen in Figure 12.5 the PSO algorithm was

successful in calculating the cluster centres of the test dataset.

12.4.3 Cluster purity

As discussed in Section 10.2.3.6 the cluster purity (or error rate) is used for outlier

detection on the data. Another simple test was devised to ensure that the PSO algorithm

incorporates post-classified positions of elements. Figure 12.6 depicts a dataset with

two spherical clusters, the left one having 400 elements and the right one 100 elements.

Figure 12.5: K-Means and PSO cluster centre analysis

As can be seen in Figure 12.6 the K-means algorithm uses nearest neighbour search

5 generated clusters K-Means cluster centres

PSO cluster centres

176

only and thus it clusters elements from the left cluster into the right cluster due to the

Euclidean distances. The PSO algorithm makes use of the effective clustering error rate

(number of miss-placed points) to implement outlier detection and thus clusters the

elements into two distinct spheres.

12.4.4 Feature vector analysis

As discussed in Chapter 9, relational database tables require data transformation in

order to get the data into the required format.

Figure 12.6: K-Means and PSO cluster centre analysis with error rate feedback

Two generated clusters K-Means cluster centres

PSO cluster centres

177

The method discussed is called a feature vector approach, which constructs binary

mappings of table attributes and their occurrences. The best way to assess the accuracy

of the feature vector approach would be to cluster a known multivariate dataset using

the PSO algorithm, and then to construct relational tables to represent the dataset and

then run the process on that.

12.4.4.1 Iris dataset control set

Both K-Means and the PSO algorithm were run on the Iris dataset and the results of

various factors are listed in this section. The experiments were run 30 times on the

dataset using each algorithm and the results are the averages of these runs. A standard

deviation is noted as well.

Table 12.3: Intra-cluster distances on the Iris dataset

Factor K-Means PSO Avg 106.90 97.05 (Std) (14.30) (0.08) Best 96.89 96.98

Table 12.4: Error rates on the Iris dataset

Factor K-Means PSO Avg 17.95% 11.50% (Std) (11.02%) (2.04%) Best 11.02% 10.00%

The goal of this test is to provide a test set for the accuracy obtained by clustering the

Iris dataset from the raw data.

12.4.4.2 Iris dataset as Relational table

Addendum A provides sample structures of two relational views of the Iris dataset. The

feature vector construction task was firstly run on this set in order to obtain the

required mappings and then the PSO algorithm was run on it. As can be seen in Tables

12.3 and 12.4 the effective inter-cluster distances and error rates are very close if not

identical to the raw dataset. The differences can be ascribed to the stochastic nature of

the PSO algorithm.

178

Table 12.5: Intra-cluster distances on the Iris dataset for simple relational mapping

Factor PSO (Raw) PSO (Simple table) Avg 97.05 97.10 (Std) (0.08) (0.082) Best 96.98 97.01

Table 12.6: Error rates on the Iris dataset for simple relational mapping

Factor PSO (Raw) PSO (Simple table) Avg 11.50% 11.51% (Std) (2.04%) (2.04%) Best 10.00% 10.00%

As can be seen in Tables 12.5 and 12.6 the effective inter-cluster distances and error

rates are very close if not identical to the raw dataset and the simple mapping. The

differences can be ascribed to the stochastic nature of the PSO algorithm and the more

complex mappings produced for the high-relational structure. Tables 12.7 and 12.8 list

the effective inter-cluster distances and error rates for the complex relational mapping.

Table 12.7: Intra-cluster distances on the Iris dataset for complex relational mapping

Factor PSO (Raw) PSO (Simple table) PSO (Complex table)

Avg 97.05 97.10 97.20 (Std) (0.08) (0.082) (0.083) Best 96.98 97.01 96.99

Table 12.8: Error rates on the Iris dataset for complex relational mapping

Factor PSO (Raw) PSO (Simple table) PSO (Complex table)

Avg 11.50% 11.51% 11.51% (Std) (2.04%) (2.04%) (2.06%) Best 10.00% 10.00% 9.97%

12.4.4.3 Randomised multivariate structures

Table 12.9 summarises the results obtained from the randomised multivariate runs.

179

Table 12.9: Fitness values and cluster centers for 3 sets of random data

As can be seen the fitness values obtained for the relational mappings are very close to

those obtained by the raw dataset. The fitness values are the intra-cluster distances and

are specific to each generated dataset.

The aim of the feature vector analysis was to prove that the process can cluster multi-

relational data very closely to the way that the raw data would be clustered. To this aim

multivariate datasets and resulting relational structures were generated and tested to

document the actual cluster vectors generated. The generated raw data and structure

was stored to allow multiple tests to be repeated on the same set. The Raw and

Relational processes were run 30 times on each set.

12.4.4.4 Feature vector conclusion

By way of conclusion it is clear that through a controlled test of known data and user-

Dataset generated

PSO (Raw) PSO (Relational)

Fitness Centres Fitness Centres

3 Centres, 2 Attributes (2D)

0.00076876 (68.5436, 11.0967) (8.5423, 10.4508) (37.1243,-5.997)

0.00077830 (69.0002, 10. 995) (8.7765, 10.012) (36.9996,-5.943)


279.4876 (1.0245, 3.7645) (-0.5412, 0.3554) (4.2769, 4.002)

279.6523 (1.115, 3.8871) (-0.4999, 0.3667) (4.254, 4.1321)


5321.2456 (6.8443,5.2212, 3.2213) (0.7754, 9.1235, 10.3456) (27.0032, 66.9856, 0.3224)

5315.2409 (6.9432,5.2140, 3.1476) (0.7105, 9.2106, 9.9986) (26.9965, 68.0045, 0.3665)

180

designed relational tables the feature vector approach yielded results similar to the raw

data. The set of randomised raw and relational structures confirmed the control set’s

observations.

12.5 Clustering conclusion

As can be seen from the results presented there were two tests to consider. First to

consider was the PSO algorithm’s ability to cluster more accurately (by making use of

outlier detection) than the K-Means algorithm. For the Iris dataset it was noted that the

K-Means algorithm tended to cluster the data into two clusters while the PSO algorithm

managed to cluster the data into three clusters, which more accurately depicts the

search space. Secondly, the feature vectors approach to essentially denormalise the

relational structures into vectors of attributes (henceforth used interchangeably) and

then to perform clustering based on the pattern was tested. It can be concluded that the

PSO algorithm performed well in terms of clustering accuracy and that the feature

vectors approach introduced some deviance between the raw data and relational data.

This could be ascribed to both the stochastic nature of the PSO algorithm itself and the

complex vector structures produced by high-dimensional tables. The resultant table

structures and cluster centres are stored in the data mart.

12.6 GA-based Classification results

As discussed in Chapter 11 the GA/KNN classification process has dimensionality

reduction at its disposal that only allows the most relevant features of a dataset to be

included for analysis. The results of these experiments will be presented in Sections

12.7 to 12.10.

12.7 GA algorithm parameters

As discussed the Chapter 11, the GA-based algorithm requires a number of parameters.

Their values are listed in Table 12.10.

12.8 Randomised datasets

The first thing to analyse was the ability of the GA-based algorithm to identify

181

significant features and give them the appropriate weights. With reference to Chapter

11 the identification of the transformation matrix 풘∗was the goal. The matrix 풘∗was

then used to compute the reduced dataset by eliminating the features not required. To

this end a similar approach used by [Pei98] was followed where a feature matrix was

generated. This feature matrix incorporates known significant features based on other

features. Features 1-3 and 9-12 contain stochastic values in the range {0.0 – 10.0} while

features 4-8 contain generated expressions which incorporate other fields selected from

the other features in their calculations. This mapping ensures that the observers know

which features are important and which are not. Table 12.11 represents an example of

such a generated dataset where Fi is a feature selected from 2-8. Effectively this means

that features 2-8 are significant to the dataset and that 1 and 9-12 are not.

Table 12.10: GA-based algorithm parameters

GA Parameter

Crossover rate

Mutation rate

Population size

Generations 훾 (Fitness) 훿 (Fitness)

0.8

0.1

100

50

2.0

0.2

The generated feature matrix was stored for multiple iterations to be run and tested.

The first goal was to generate the 풘∗transformation matrix. Thus 45 permutations of

each class V-Z were generated for each of the 12 features based on the generated matrix

resulting in 540 patterns of this randomised set. This set of 540 patterns was used as a

training set to calculate 풘∗ . The results of an example run of such a set are seen in Table

12.12.

The fittest 풘∗transformation matrix was selected from the training set and was used to

generate reduced datasets for the 1000 permutations generated from the same feature

182

matrix used to calculate 풘∗. The fitness function uses 8 bits to represent each weight

component, wi, in the weight vector 풘∗based on the function defined in Section 11.8.2.

Table 12.11: Generated Feature matrix

Class Features 1-3 Feature 4-8 Feature 9-12

V {0,1} {0,1} * [Fi] {0,1}

W {0,1} {0,1} * [Fi] + {0,1} {0,1}

X {0,1} {0,1} * [Fi] - {0,1} {0,1}

Y {0,1} {0,1} * [Fi] + {0,1} * 10 {0,1}

Z {0,1} {0,1} * [Fi] {0,1}

Table 12.12: Transformation matrix 풘∗values

풘∗value F1 F2 F3 F4 F5 F6 F7 F8 F9-12

Selected 0 1 1 1 1 0 1 1 0

Weight 0.000 3.2677 2.7643 5.6753 6.4343 0.000 2.5454 8.1143 0.000

The KNN classifier was compared to the GA/KNN classifier by running both 10 times on

each set of 540 patterns, generating the best 풘∗and then calculating the transformed

dataset. It was concluded that the KNN classifier produced an average 4.5%

classification error rate and the GA/KNN had about a 3.1% error rate.

The goal of this test was to assess the GA/KNN algorithm’s ability to identify significant

features from a feature matrix with known significant features. The average running

183

time to compute the 풘∗matrix for this example is about 10 minutes. The calculation of

the reduced feature matrix is pure matrix multiplication.

12.9 UCI-standard datasets

In order to gauge the classification accuracy of real-world datasets the Iris and Glass

datasets were compared. In the previous section a test dataset was generated by

allowing possible values from the set range to be generated. In these examples one has

the given set of data with the Iris set containing 150 exemplars and the Glass set 214

exemplars. Suitable values could not be randomly generated, hence the idea was to use

a subset of the data as training data and the other as test data. A good idea is a 50/50

split (also known as leave-one-out) so 75 of the Iris set exemplars and 107 of the Glass

set were used as training sets.

An example for the Iris dataset is given to illustrate the testing process followed for

real-world datasets. The labels for the class are setosa, versicolor and verginica. The

pre-processing task to consider was the way in which the user knew which elements

should be in which class and the goal was to then perform dimensionality reduction of

the features to still yield the same results. Hence 25 random exemplars from each class

was used to train the GA/KNN classifier and yield the best 풘∗transformation matrix for

the Iris dataset. This was run 30 times. Then the 풘∗transformation matrix was used to

calculate the reduced set from another 25 random exemplars from each class. This was

done 1000 times and each resultant reduced feature set was tested for classification

accuracy. Table 12.13 summarises the average results of both the Iris and Glass

datasets.

As can be seen in Table 12.13 the GA/KNN approach by producing the reduced dataset

from the 풘∗ transformation matrix is more able to classify the set than the KNN

approach with a lower error rate. The runtimes stated in the table are for a single

iteration of each algorithm’s run.

These datasets have been extensively researched and the purpose of this test was to

determine if the GA/KNN classifier was able to produce a reduced dataset which also

more accurately classifies the datasets. As observed the transformation vector produced

184

by the GA highlighted that features 3 and 4 are significantly more important to the Iris

dataset than features 1 and 2. Thus the conclusion can be drawn that either features 1

and 2 could possibly be omitted from the dataset to reduce the search space size.

Table 12.13: Iris and Glass datasets

Dataset KNN GA/KNN

Error Rate Time (s) –

single run

Error Rate Time(s) –

single run

Iris 7.1245% 0.547643 5.24365% 0.42481

Glass 3.6894% 0.600121 2.48032% 0.45632

12.10 High-dimensional datasets

Most relational databases have structures with a high number of dimensions that will

result in very big feature vector mappings. The real test for the GA/KNN approach

would be to subject it to a known dataset of high dimensionality. To this end the 2, 4-D

dataset as described in Addendum A was used. This dataset has 96 features which

results in a 96-bit chromosome mapping. As each weight component of the 풘∗matrix

requires 8-bits when the actual GA is running the calculated weight values for each

feature in the resultant chromosome is actually 96 * 8 = 768 bits in size. As with the Iris

example the training set must be a subset of the original set but due to the high number

of features and low number of examples all of the original exemplars are selected and

ran though the GA/KNN classifier. Due to the number of features this was only run once.

The resultant best 풘∗transformation matrix was used to calculate the reduced set and

this was used on a random selection of 1/10th of the original set and ran no more than

10 times. The accuracy results are summarised in Table 12.14 that depict the error rate

of the results.

On inspection of the resultant dataset for this example it was found that 65 of the

185

original features were disregarded for being 0 weighted. This effectively provided a

67% reduction in the search-space. The significance of the results shows that on a full

dataset the error rate was lower. This means that a more accurate classification can be

made based on the number of exemplars inspected with the downside being

performance. The original feature vector and reduced set are saved in the data mart.

Table 12.14: Error rate of 2, 4-D dataset

Total patterns K-Means GA/KNN

23 8.3452% 1.9965%

232 6.6785% 0.82567%

12.11 Classification conclusion

It can be concluded that the GA/KNN classification approach does outperform the KNN

approach in terms of accuracy and the reduction of the dataset. The major problem

arises as the number of features increase, since the computation time required

increases as well.

12.12 The data mining mart

As discussed in Chapter 8, the purpose of the overall process is to store the centroid

vectors and reduced feature vectors for the analysed data structures. The reason for this

is that as long as the data structures do not change the addition of a low percentage of

new data to the set would not require the set to be reprocessed, since the same reduced

feature vectors or cluster centroids can be used by the data mining algorithms to

produce rules.

12.13 Conclusion

This chapter has presented the results of the PSO-based classification and GA-based

clustering algorithm. The results show that both methods are more effective than their

186

K-Means counterparts in performing the pre-processing tasks.

The PSO-based denormalisation approach of constructing feature vectors proved to

yield near-similar results to the processing of the raw datasets. Therefore, it was seen as

an acceptable method to denormalise relational structures for clustering.

The GA/KNN classifier constructs a singular feature vector from all the required

attributes. Thus the more complex the relational structure with a higher dimensionality

the more processing time is required to produce the reduced dataset.

The major concern in this chapter was that of computational performance for high-

dimensional datasets as well as the correlation between the amount of examples in a

training set that is used to ultimately reduce the set.

Chapter 13 provides an overview of the dissertation, discusses whether the research

questions were answered and raises possible future work.

187

1. Introduction


warehousing




12. Implementation

and Results




EPS4DM Model

188

Chapter 13

Conclusion and future work

13.1 Introduction

In this dissertation a data pre-processing model named EPS4DM was presented. This

model collated the required relational data as decided upon by a user, performed the

pre-processing tasks and stored the processed data in a data mining mart. This enables

the data mining rule extraction algorithms to use this mart as input source for their

process.

The main idea behind the research was to give an overview of the current state of

production OLTP databases and data warehousing structures as the input to the data

mining process. A further goal was the investigation of and informing of the reader of

the current data mining pre-processing tasks and the way in which computational

intelligence (CI) based algorithms coupled with multi-agent technology can manage

these tasks.

Another goal of the dissertation was to investigate the viability of implementing agents

embedded with CI to orchestrate the pre-processing tasks in order to maintain a data

mining mart. Chapter 12 presented concrete results that once the dataset has been

chosen, can run both the clustering and classification tasks to completion and store

their results in the data mart without user intervention. The GA/KNN classifier alluded

to the performance of complex data-types and will be discussed while answering the

research questions.

13.2 Evaluation

The denormalisation of the chosen database structures for the clustering process was of

primary importance. Thus the data transformation component named feature vector

construction was analysed by comparing the accuracy of clustering data with both the

raw format and feature vectors as input. The results indicated that the approach is an

acceptable manner in which to automate the denormalisation of required relational

189

structures.

The clustering performance of the PSO-based algorithm was assessed on both generated

and known real-world datasets by using the patterns generated by the transformation

process. This was a good method to use when compared to the K-Means algorithm. The

automation of the clustering process is also sound as the cluster centres for the input

parameters were generated and stored in the data mining mart.

The classification performance of the GA/KNN algorithm was also assessed on

generated and high-dimensional real-world datasets. It performed well to reduce the

feature space but it did yield a correlation on performance. This will be addressed in the

following section. The answers to the research questions posed in Chapter 1 will be

discussed in Section 13.3.

13.3 Research-driven questions that had to be answered by the dissertation

Question 1: Why are OLTP system databases structured the way they are, how is

this data stored and in what way does data warehousing differ from

data mining?

Chapters 2 and 3 introduced the data storage concepts from their first inception to the

current state of the art approaches. Single programs used files to persist with some

form of data that it required. As systems became more complex and multiple programs

began to interact with each other using the same data the need arose for a database – a

single storage place for multiple programs. When the need arose for increased access to

these databases the concept of database clustering was introduced to allow access to

multiple instances of databases.

The most profound change was the introduction of the relational database – which

forms the foundation of the modern OLTP system. Transaction processing is at the core

of the modern organisation and databases are structured in a way to allow optimum

insertion and access to the data in these tables for systems to interact with each other.

The data in this database is usually stored in third normal form and will most commonly

represent the latest up-to-date data.

190

Decision support systems were the first implementations of what is known as data

warehouses today. They need to allow businesses to make decisions based on the data

they have. A data warehouse differs from an OLTP database implementation in that it is

a subject-oriented, integrated, non-volatile, and time-variant collection of data.

Data mining can be seen as a tool that is part of the Knowledge Discovery process and

can be defined as the process of discovering hidden information from existing data

repositories. It is a collection of processes specifically designed to derive highly valuable

information, which when viewed in the context of the domain it was mined in, provides

knowledge and insight to the user.

Question 2: What are the current data mining pre-processing tasks and how are

they performed?

The data mining pre-processing tasks are as follows:

Data clean-up – This part of the process is still very much a manual process,

missing data inconsistencies and duplicate data need to be sorted out at a

domain level.

Data denormalisation – In order to get the relational data into a format usable

by the algorithms it must be transformed and this is still very much a user-driven

process.

Clustering – This is the process of detecting subsets of similar data and grouping

them together. There is a multitude of algorithms available to cluster data and

the traditional mathematical approaches were introduced.

Classification – This is the process of arranging data into a pre-defined class.

The K-nearest neighbour approach was introduced.

Dimensionality reduction – This is the process of reducing the dimension of

the dataset in order to improve classification performance. The principal

component analysis and its mathematical fundamentals were introduced.

191

Question 3: How can multi-agent technology embedded with CI orchestrate and

execute the pre-processing tasks?

The pre-processing system that was introduced caters for four of the five pre-

processing tasks defined. The data clean-up task remains one for the domain user to

undertake, as no computer system (to date) is capable of fixing domain-related

problems in datasets.

The other four tasks were orchestrated and automated by making use of multi-agent

technology and are discussed herewith:

Data denormalisation – Once the user has selected the target table structure

the agents can perform either a feature vector process to produce the pattern set

required for clustering or construct a feature vector for the classification process

to run. The results of the selection are stored in a data mining mart.

Clustering – Once the pattern has been constructed the agents can execute the

classification of the data stored in the mart and produce a set of cluster centres

for the pattern. The centres are then also stored in the data mart.

Classification as means of dimensionality reduction – When the feature

vectors have been constructed the agents can execute the clustering process that

will result in the storage of both a transformation matrix and a reduced dataset.

Question 4: How viable is it to maintain a data mining mart with pre-processed

relational data?

The results of the PSO and GA based clustering and classification algorithms produced

the following conclusions.

The process of clustering is not as complex as classification but the data

transformations required to perform clustering is. The converse is true for classification

in that the process is much more complex but the data transformations are not.

The clustering process yielded good results and a relatively acceptable runtime

performance because the major part of the time used is for the transformation of the

192

data. This is done in the first phase and stored for the selected structures. The

classification process alluded to two distinct conclusions: firstly, that the curse of

dimensionality plays a role and that processing time dramatically increases with the

number of features, and secondly, that the training set is of utmost importance, the

more data there is the more accurate the reduced set can be classified. The trade-off is

that as the amount of data increases so too will the processing time.

The viability of maintaining a data mart consisting of patterns, vectors, cluster centres

and reduced datasets ultimately comes down to system performance. However as was

seen in the results on classification, once the feature transformation matrix is built it

can reduce the number of features that will ultimately speed up the actual classification

process on the real data. The author is thus faced with the debate of accuracy versus

time required. If the algorithm can produce an acceptable result in an hour but takes 12

hours to refine the answer how much better is 12 times the effort. There is however a

solution to this dilemma which is posed in the section on future work.

13.4 Future work

High dimensional datasets contain many features which if combined with the sheer

magnitude of data often found in these datasets might lead to performance implications

on single CPU implementations. The aim of this research was to ascertain whether a

data mining mart can be maintained from the relational data sources and the answer to

that question is yes, even on a single dual-core system as used in the test cases it is

viable if performance is not a key concern.

The choice of employing multi-agent technology and CI-based algorithms help to

alleviate performance limitations as these components can be parallelised either on

distributed systems or even ported to GP-GPU implementations. Furthermore,

enhancements to the PSO and GA to improve diversity and stability needs to be

investigated.

13.5 Conclusion

In this dissertation an overview of the history of data storage was provided to the end of

enlightening the reader to the reasons behind the structure of modern data storage

193

constructs. Data warehousing systems were discussed and their role within the process

of knowledge discovery became clear. An overview of the process of data mining and

the pre-processing tasks were also given.

A data pre-processing system was presented which allows for the automated

orchestration of almost all the major data mining pre-processing tasks. The system was

able to denormalise selected database tables, calculate cluster centres and reduced

datasets after the user has selected the structure required.

The future work plays a major role in the success of implementing this system in a

corporate environment due to the performance constraints posed by high-dimensional

data. However the advantages of having pre-processed data in the form of cluster

centres and reduced datasets for rule extraction algorithms to run on are clear. Besides

data clean-up the system is capable of performing these tasks without user intervention

and thus if implemented correctly, it could aid in data mining applications.

The trade-off between accuracy and time will always be there but with improvements in

computer hardware and the software process of parallelising algorithms means that a

system like this, when setup for chosen database tables, could run in the background

and generate up-to-date pre-processed data without the need of users overseeing the

process.

194

References

[Ada10] Adamson, C (2010). Star Schema The Complete Reference ISBN 0071744320

[Agg13] Aggarwal, C.C., Reddy, K.C. (2013) Data Clustering: Algorithms and Applications (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series) ISBN 1466558210

[Agg12] Aggarwal, C.C., Zhai, C. (2012) Mining Text Data ISBN 1461432227

[Ant09] Anthes, G. (2009). Deep Data Dives Discover Natural Laws. Communications of the ACM, Volume 52(11) pages 13-14. (Accessed 23/03/2010).

[And06] Anderson, B.J., Gross, D.S., Musicant, D.R., Ritz, A.M., Smith, T.G. & Steinberg, L.E. (2006). Adapting K-Medians to Generate Normalized Cluster Centers (Abstract). Proceedings of the sixth SIAM international conference on Data Mining held in Bethesda, Maryland, April 20-22, 2006.SIAM. (Accessed 10/05/2012).

[Aff09] Affenzeller, M. et al. (2009) Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications (Numerical Insights) ISBN 1584886293

[Atr] Atramentov, A & Honavar, A. (n.d). Speeding Up Multi-Relational Data Mining. Available from: citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5 (Accessed 26/10/2013).

[Alf07] Alfred, A & Kazkov, D. (2007). Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique (Abstract). Ioannidis,Y., Novikov. & Rachev, B. (Eds.) (© 2007). Local Proceedings of ADBIS 2007, pages 136-147. Technical University of Varna Available from: www.adbis.org/docs/lp/13.pdf (Accessed 10/05/2012).

[Ana06] Analoui, M. & Amiri, M.F. (2006). Feature Reduction of Nearest Neighbour Classifiers using Genetic Algorithm, Proceedings of the World Academy of Science, Engineering and Technology. Available from:M Analoui, MF Amiri - World Academy of Science, Engineering and …, 2006 - Citeseer (Accessed 10/05/2012).

195

[Ban05] Bandyopadhyay, S., Maulik, U., Holder, L.B. & Cook, D.J. (eds). (2005). Advanced Methods for Knowledge Discovery from Complex Data, Springer, Advanced Information and Knowledge Processing, volume XVIII

[Bie03] Biere, M. (2003). Business Intelligence for the Enterprise, IBM Press, ISBN 0131413031

[Bla02] Blackwell, T.M. & Bentley, P.J. (2002) Dynamic Search with Charged Swarms, Proceedings of the Genetic and Evolutionary Computation Conference, pages 19–26. ISBN: 1-55860-878-8.

[Blo03] Blockeel, H. & Sebag, M. (2003). Scalability and efficiency in multi-relational data mining (Abstract). SIGKDD Explorations, 5:17-30. Available from: citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5511 (Accessed 10/05/2012).

[Boz03]

Bozdogan, H. (2003) Statistical Data Mining and Knowledge Discovery ISBN 1584883448

[Bra13] Bramer, M. (2013) Principles of Data Mining (Undergraduate Topics in Computer Science) ISBN 1447148835

[Bul12] Bulusu, L. (2012) Open Source Data Warehousing and Business Intelligence ISBN 1439816409

[Bur10] Burd, S.D. (2010) Systems Architecture [Paperback] ISBN 0538475331

[Cun08] Cunningham, P. (2007). Dimensionality Reduction: A Comparative Review. University College Dublin, Technical Report UCD-CSI-2007-7 August 8th, 2007. Available from: www.bradblock.com/Dimension_Reduction.pdf (Accessed 10/05/2012).

[Can06] Cano, J.R., Herrera, F. & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3):323–332. Available from: http://www.sciencedirect.com/science/article/B6W86-4G24X8J-1/2/bde2d2186e16a70cdcdb38480f7f217e) (Accessed 15/03/2010).

[Car12a] Carnegie mellow University (2012) Multi-agent systems Available from: http://www.cs.cmu.edu/~softagents/multi.html

196

[Car12b] Carreira-Perpiñán, M Á. (n.d.). Dimensionality Reduction by Unsupervised regression, Lecture slide, EECS, UC Merced, California. Available from: faculty.ucmerced.edu/mcarreira-perpinan/.../cvpr08b-slides.pdf (Accessed 10/05/2012).

[Cel12] Celco, J. (2012) Joe Celko's Trees and Hierarchies in SQL for Smarties, Second Edition (The Morgan Kaufmann Series in Data Management Systems) ISBN 0123877334

[Che04] Chen, C. & Ye, F. (2004). Particle Swarm Optimization Algorithm and Its Application to Clustering Analysis. Proceedings of the 2004 IEEE International Conference on Networking, Sensing Control held in Taiwi, Taiwan, March 21-23, 2004. Available form: dsp.szu.edu.cn/.../Clustering/02....

[Chi05] Di Chio, C., Poli, R. & Langdon, W.B. (2005). Evolution of Force-Generating Equations for PSO using GP, Article available from the Department of Computer Science, University of Essex, UK. Available from: http://cswww.essex.ac.uk/staff/poli/papers/gsice2005.pdf (Accessed 20/07/2013).

[Chu05] Hugh A., Chipman, H.A. & Gu, H. (2005). Interpretable Dimension Reduction, 32(9):969-987. Available from: www.mscs.dal.ca/~hgu/CJAS116847.pdf (Accessed 12/11/2012).

[Cor12] Coronel C, et al H. (2012). Database Systems: Design, Implementation, and Management ISBN 1111969604

[Dar06] Darmont, J., & Boussaïd, O. (eds). (2006).Processing and Managing Complex Data for Decision Support, IGI Global, ISBN 1591406552

[Dat05] Data Mining Preprocessing. (2005). Lecture 3, MATH3346 -Data Mining, S2/2005.Available from: medecine.forumed.org/pdflike.php?search=Data%20Mining (Accessed 10/05/2012).

[Dat08] Date, C.J. (2008). The Relational Database Dictionary, Extended Edition, firstPress, ISBN 1430210419

[Dat12] Date, C.J. (2012). Database design and Relational theory: normal forms and all that Jazz (Theory in Practice) ISBN 1449328016

197

[Dav08] Davidson, L., Kline, K., Klein S. & Windisch, K. (2008). Pro SQL Server 2008. Relational Database Design and Implementation, Apress, ISBN 143020866X

[Deb06] Dehuri, S., Ghosh, A. & Mall, R. (2006). Generic Algorithms for Multi-Criterion Classifications and Clustering in Data Mining. Journal of Computing and Information Sciences,4 (3):143-154. Available from: www.lania.mx/~ccoello/EMOO/dehuri06.pdf.gz (Accessed 10/05/2012).

[Deh08] Dehuri, S., Patnaik, S., Ghosh, A. & Mall, R. (January 2009). Application of elitist multi-objective genetic algorithm for classification rule generation. Applied Soft Computing, 8(1):477–487. Available from: http://www.sciencedirect.com/science/article/B6W86-4NB2WF0-2/2/b637ae128f8a5bb36ddf0f4e93a8ad44 (Accessed 15/03/2010).

[Doh03] Doherty, M. (2003).Database Cluster for e-Science, Proceedings of UK e-Science All Hands Meeting 2003. Available from: http://www.eminerals.org/papers/AH_2003_databasecluster.pdf (Accessed 01/06/2010).

[Dou08] Dou,W., Hu, J., Hirasawa, K. & Wu, G. (2008). Distributed Multi-Relational Data Mining Based on Genetic Algorithm (Abstract). 2008 IEEE Congress on Evolutionary Computation (CEC 2008). Available from: www.hflab.ips.waseda.ac.jp/~jinglu/Publics/cec08Dou.pdf

[Eng07] Engelbrecht, A. P. (2007). Computational Intelligence : An Introduction, 2nd edition, John Wiley &Sons. ISBN 1-843-34355-X

[Eve11] Everit, S.B. et al (2011) Cluster Analysis (Wiley Series in Probability and Statistics) ISBN 0470749911 2011

[Fon09] Fonesca, C.M. et al. Evolutionary Multi-Criterion Optimization: 5th International Conference, EMO 2009, Nantes, France, April 7-10, 2009, Proceedings (Lecture Notes in ... Computer Science and General Issues) ISBN 3642010199

[Far07] Faraoun, K.M. & Rabhi, A. (2007). Data dimensionality reduction based on genetic selection of feature subsets(Abstract). DOI: 10.1145/314403.314414. Available from: thescipub.com/pdf/10.3844/jcssp.2011.707.714 - United States

198

[Fel07] Feldman, R. & Sanger, J. (2007).The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, ISBN 0521836573

[Fer13] Fernandez, A et al. An Overview on the Structure and Applications for Business Intelligence and Data Mining in Cloud Computing 7th International Conference on Knowledge Management in Organizations: Service and Cloud Computing Advances in Intelligent Systems and Computing Volume 172, 2013, pp 559-570 DOI 10.1007/978-3-642-30867-3_50

[Fra96] Franklin, S.; Graesser, A. (1996). "Is it an Agent, or just a Program?: A Taxonomy for Autonomous \ Agents University of Memphis, Institute for Intelligent Systems

[Fre05] Freitas, A. A. (2005). Evolutionary Algorithms for Data Mining. Chapter 20 in Data Mining and Knowledge Discovery Handbook, pages 435– 467, US: Springer.

[Gan07] Gan, G. et al (2007) Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability) ISBN 0898716233

[Gho04] Ghosh, A. & Nath, B. (June 2004). Multi-objective rule mining using genetic algorithms. Information Sciences, 163(1-3):123–133, Soft Computing and Data Mining. Available from: http://www.sciencedirect.com/science/article/B6V0C-4B3JHTS-1/2/03b327519650d278df6aeed05059393c (Accessed 15/03/2010).

[Gho06] Ghosh, J. Proceedings of the Sixth SIAM International Conference on Data Mining (2006) Page 166. (Accessed 25/10/2013).

[Giu12] Giusti, A. et al. (2012) Classification and Data Mining (Studies in Classification, Data Analysis, and Knowledge Organization) ISBN 3642288936

[Gra12] Graham, A. (2012) The Enterprise Data Model: A framework for enterprise data architecture, 2nd edition ISBN 0956582915

[Gup10] Gupta, R.V. (2010). Whitepaper: Introduction to Data Warehousing, Published on www.system-services.com System Services Corporation. Available from: http://www.system-services.com/dwintro.asp (Accessed: 08/04/2010).

199

[Gup09]

Gupta, S.B. & Mittal, A. (2009).Introduction to Database Management System, Laxmi Publications, ISBN 9381159319

[Had13] Apache Hadoop project. http://hadoop.apache.org/ (2013)

[Han11] [Han, J et al. Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems) ISBN 0123814790

[Han10] Han, J. (September 21, 2010). Data Preprocessing, Chapter 2 adapted from Han, J. & Kamber, M. (2006) Data Mining: Concepts and Techniques. Available from: www.liacs.nl/~erwin/dbdm2010/03_dbdm2010_Data_Preprocessing.pdf (Accessed 10/05/2012).

[Hib13] JBoss Hibernate ORM for Java http://hibernate.org/ (2013)

[Hus02] Hussein, A., Sarker, R. A. & Newton, C. S. (eds). (2002). Data Mining: A Heuristic Approach, IGI Global. ISBN 1930708254

[Hau04] Haupt, R. Haupt, S.E. (2004) Practical Genetic Algorithms ISBN 0471455652

[Hal08] Halpin, T. (©2008). Information Modeling and Relational Databases, Second Edition. Morgan Kaufmann ublishers, Books24x7. (Electronic Version). ISBN 0123735688

[Har08] Harizopoulos, S., Abadi, D.J., Madden, S. & Stonebraker, M. (2008). OLTP through the looking glass, and what we found there. Proceedings of the 2008 ACM SIGMOD international Conference on Management of Data held in Vancouver, Canada, SIGMOD '08. ISBN 978-1-60558-102-6

[Iba13]

MyBatis Project for Java http://blog.mybatis.org (2013)

[Inm05] Inmon, W. H. (2005). Building the Data Warehouse, Fourth Edition, John Wiley & Sons, ISBN 978-0-7645-9944-6

[IBM07] IBM. (2007) The IBM mainframe, Building on the past, defining the future, Published on www-07.ibm.com. Available from: http://www-07.ibm.com/systems/my/z/about/timeline/ (Accessed: 12/04/2010).

200

[Kai08]

Kain, F. (May 2008). How far is Pluto from Earth? , (Electronic Version). Published on www.universetoday.com Available from: http://www.universetoday.com/guide-to-space/pluto/how-far-is-pluto-from-earth/ (Accessed 08/04/2010)

[Kar08] Kari, L. & Rozenberg, G. (2008). The many facets of natural computing. Communications of the ACM, 52(10):72-83.

[Kay10] Kaya, M. (April 2010). Autonomous classifiers with understandable rule using multi-objective genetic algorithms, Expert Systems with Applications, 37(4):3489–3494. Available from:http://www.sciencedirect.com/science/article/B6V034XFY0F0- 10/2/6bd705a14d2061aaed9a70b5549219b3 (Accessed 15/03/2010).

[Kep10] Kepliz, B. Verbrugge, R. (2010) Teamwork in Multi-Agent Systems: A Formal Approach (Wiley Series in Agent Technology) ISBN 0470699884

[Khu08] Khushaba, R.N., Al-Ani, A. & Al-Jumaily, A.(2008). Differential Evolution based Feature Subset Selection. 19th International Conference on Pattern recognition, ICPR 2008, pages 1-4, IEEE. Available from: RN Khushaba, A Al-Ani… - Pattern Recognition, 2008. …, 2008 - ieeexplore.ieee.org (Accessed 10/05/2012).

[Kim08] Kimball, R. (2008).The Data Warehouse Lifecycle Toolkit, Second Edition, John Wiley & Sons. ISBN 978-0-470-14977-5

[Kim13] Kimball, R. (2013).The Data Warehouse Toolkit: The definitive guide to dimensional modelling, Third Edition, ISBN 1118530802 (Accessed 16/08/2013).

[Kir13] Kiranyaz, S et al. Multidimensional Particle Swarm Optimization for Machine Learning and Pattern Recognition (Adaptation, Learning, and Optimization) ISBN 3642378455

[Kog13] Kogan, J (2013) High Dimensional Introduction to Clustering Large and High-Dimensional Data ISBN 0521617936

[Kom11] Komarasamy, G & Amitabh, W Improving the cluster performance by combining PSO and K-Means algorithm. Department of Computer Science and Engineering Bannari Amman Institute of Technology (http://ictactjournals.in/ArticleDetails.aspx?id=405)

201

[Kuo09] Kuo, R.J., Chao, C.M. & Chiu, Y.T. (November 2009). Application of particle swarm optimization to association rule mining. Applied Soft Computing. Available from: http://www.sciencedirect.com/science/article/B6W86-4XSJVHK-7/2/5d3748225826351222be084cbd91b141 (Accessed 15/03/2010).

[Lab11] Laberge, R The Data Warehouse Mentor: Practical Data Warehouse and Business Intelligence Insights ISBN 0071745327

[Lan07] Langit, L. (2007). Foundations of SQL Server 2005 Business Intelligence, Apress. ISBN 1-59059-834-2

[Led13] Ledolter, J. (2013) Data Mining and Business Analytics with R ISBN 111844714X

[Lig07] Lightstone, S., Teorey, T. & Nadeau, T. (2007). Physical Database Design: The Database Professional's Guide to Exploiting Indexes, Views, Storage, and More, Morgan Kaufmann Publishers, ISBN 9780123693891

[Lob09] Lobel, L., Andrew J.B. & Forte, S. (2009). Programming Microsoft SQL Server 2008, ISBN 9780735626027

[Lon09] Longbing, C., Gorodetsky, V. & Mitkas, P.A. (May-June 2009). Agent Mining: The Synergy of Agents and Data Mining. Intelligent Systems, Journal IEEE Intelligent Systems, May/June 2009 (vol. 24 no. 3), May, 2009, 64-72.

[Mai08] Maimon, O. & Rokach, L. (2008). Soft Computing for Knowledge Discovery and DataMining. Introduction to Soft Computing for Knowledge Discovery and DataMining, pages 1–13. US: Springer.

[Man05] Manolopoulos, Y., Papadopoulos, A. N. & Vassilakopoulos. M Gr. (eds). (2005). Spatial Databases: Technologies, Techniques and Trends, IGI Global, ISBN 1591403871

[Mic02] Corporation, Microsoft. Microsoft Computer Dictionary, (2002). Fifth Edition, Microsoft Press, ISBN 735614954

202

[Mur06] [Nos13]

Murphy, C. (2006). Effects of Swarm Size on Attractive-Repulsive Particle Swarm Optimisation. Available from: http://ncra.ucd.ie/COMP40580/crc2006/murphy.pdf (Accessed 18 February 2012). NoSQL database. http://nosql-database.org/

[Ols08] Olson, D.L. & Delen, D. (2008) Genetic Algorithm support to Data Mining. Chapter 8 in Advanced Data Mining Techniques, pages 125–135. Berlin Heidelberg: Springer.

[Ora11] Oracle® Database Concepts 11g Release 1 (11.1) B28318-06 (2011) Available from:http://docs.oracle.com/cd/B28359_01/server.111/b28318/schema.htm (Accessed 20/10/2013)

[Ols12] Olsson A,E. Particle Swarm Optimization: Theory, Techniques and Applications (Engineering Tools, Techniques and Tables) (Editor) ISBN 1616685271

[Oxf13] The Oxford English Dictionary: Intelligence Available from: http://www.oxforddictionaries.com/definition/english/intelligence?q=intelligence (Accessed 20/10/2013)

[Pas07] Pascalau, E. (February 2007).Smart Agent Based on Rules for Web Knowledge Mining. Digital EcoSystems and Technologies Conference,held in Cairns,DEST '07, Inaugural IEEE-IES, pages 458–461. ISBN 1-4244-0470-3

[Pei98]

Pei, M, Goodman, E, Punch, W.F. (1998) Feature Extraction using Genetic Algorithms. Case centre for Computer-aided engineering and manufacturing, Michigan university.

[Pow06]

Powell, G. (2006). Beginning Database Design and Implementation, Wrox Press, ISBN 0764574906

[Pow03] Power, D.J. (May 31, 2003). A Brief History of Decision Support Systems. Published on DSSResources.COM, World Wide Web, version 2.8. Available from: http://DSSResources.COM/history/dsshistory.html (Accessed 08/04/2010).

203

[Qu10] QU, J., Shao, Z. &, Xiyu LIU, X. (June 2010). Mixed PSO Clustering Algorithm Using Point Symmetry Distance (Abstract). Journal of Computational Information Systems 6(6): 2027-2035. Available from: www.jofcis.com/publishedpapers/2010_6_6_2027_2035.pdf (Accessed 24/10/2013).

[Raj11] Rajaraman, A. Ullman, D,J. (2011) Rule Mining Mining of Massive Datasets ISBN 1107015359

[Ray10] Rayner, A. (2010). Summarizing Relational Data Using Semi-Supervised Genetic Algorithm-Based Clustering Techniques. Journal of Computer Science 6 (7): 775-784. Available from: thescipub.com/abstract/10.../jcssp.2010.775.784 - United States (Accessed 17/09/2013).

[Red12]

Redmond, E . Wilson, J.R Seven(2012) Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement ISBN – 13-978-1-93435-692-0

[Rob07] Rob, P. & Coronel, C. (2007). Database Systems – Design, Implementation

and Management, 7th Edition, Course Technology.

[Rob12] Rob, P. & Coronel, C. (2012). Database Systems – Design, Implementation and Management, 10th Edition, Course Technology. ISBN 1111969604

[Rui09] Xu, R. & Wunsch, D. (2009). Don. Clustering, IEEE Press,

[Rus10]

Russel, S. & Norvig, P. (2010). Artificial Intelligence: A modern approach, Pearson Education Inc.

[Sch10] Schrader, M. (2012). Oracle Essbase & Oracle OLAP: The Guide to Oracle's Multidimensional Solution, Oracle Press, ISBN 0071621822

[Sho08]

Shoham, Y. Leyton-Brown, K. (2008) Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations ISBN 0521899435

[Siv08] Sivanandam, S.N. & Deepa, S.N. (2008). Genetic Programming. Chapter 6 in Introduction to Genetic Algorithms, pages 131–163, Berlin Heidelberg: Springer. ISBN 978-3-540-73189-4

204

[Sik07] Sikora, R. & Piramuthu, S. (July 2007). Framework for efficient feature selection in genetic algorithm based data mining. European Journal of Operational Research, 180(2):723–737. Available from: http://www.sciencedirect.com/science/article/B6VCT-4K1X5KR-6/2/4140bd52594e345ef0a13168fc1e0991 (Accessed 15/03/2010).

[Sil08] Silvers, F. (2008). Building and Maintaining a Data Warehouse. Auerbach Publications, ISBN 1420064622

[Sim13] Simon, D. (2013) Evolutionary Optimization Algorithms ISBN 0470937416

[Sou04] Souse, T., Silva, A. & Neves, A. (May-June 2004). Particle Swarm based Data Mining Algorithms for classification tasks. Parallel Computing, 30(5-6): 767–783, Parallel and nature-inspired computational paradigms and applications. Available from: http://www.sciencedirect.com/science/article/B6V12-4CDJKX0-1/2/79dae1807d77bbaafc5d624822e17ddb (Accessed 15/03/2010).

[Sta07] Stackowiak, R., Joseph R. & Greenwald, R. (2007). Oracle Data Warehousing and Business Intelligence Solutions, John Wiley & Sons. ISBN 0471919217

[Ste09] Stephens, R. (2009). Beginning Database Design Solutions, Wrox Press, ISBN 1861004966

[Tab09]

Tabatabaei, N.M., Mortezaeei S.R. (2009) REVIEW OF MULTI-AGENT SYSTEMS (MAS), A NEW TOOL FOR THE CONTROL AND MANAGEMENT OF MODERN POWER SYSTEMS International Journal on “Technical and Physical Problems of Engineering” (IJTPE) Transaction on Power Engineering

[Tan08] Taniar, D. (ed). (2008). Data Mining and Knowledge Discovery Technologies, IGI Global ISBN 1599049600

[Tan09] Tan, K.C., Teoh, E.J., Yu, Q. & Goh, K.C. (May 2009). A hybrid evolutionary algorithm for attribute selection in data mining. Expert Systems with Applications, 36(4):8616–8630. Available from: http://www.sciencedirect.com/science/article/B6V03-4TTMJP3-5/2/72c347f15102bbe142ac69aad1fab7ec (Accessed 15/03/2010).

205

[Tra09] Trautmann, H., Wagner, T., Naujoks, B., Preuss, M. & Mehnen, J. (Winter 2009). Statistical methods for convergence detection of multi-objective evolutionary algorithms. Evolutionary Computation, 17(4): 493–509.

[Uma13] Umakant, Mishra Introduction to Management Information System (August 8, 2013). Available at SSRN: http://ssrn.com/abstract=2307474 or http://dx.doi.org/10.2139/ssrn.2307474

[Uhr09] Uhrmacher, A.M. Weyns, D. (2009) Multi-Agent Systems: Simulation and Applications (Computational Analysis, Synthesis, and Design of Dynamic Systems) ISBN 1420070231

[Van12] Van der Lans, R. Data Virtualization for Business Intelligence Systems: Revolutionizing Data Integration for Data Warehouses (The Morgan Kaufmann Series on Business Intelligence) ISBN 0123944252

[Vog06] Voges, K. E.& Pope, Nigel K. Ll. (eds). (2006). Business Applications and Computational Intelligence, IGI Global, ISBN 9781591407027

[Wal07] Walkenbach, J. (2007). “Excel 2007—Where It Came From” Excel 2007 Power Programming with VBA, Chapter 1, John Wiley ISBN 978-0-470-04401-8

[Wan05] Wang, L. & Xiuju, F. (2005). Data Mining with Computational Intelligence, Chapter 1: Introduction, pages 1-5, Springer-Verlag Berlin Heidelberg.

[Wei00] Weiss, G. (2010) Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence ISBN 0262731312

[Wit05] Witten, I. H. & Frank, E. (2005). Data Mining – Practical Machine Learning Tools and Techniques, Second Edition, Elsevier Inc.

[Won02] Wong, M. L. & Leung, K.S. (2002). Introduction. Chapter 1 in Data Mining using grammar based genetic programming and applications, Kluwer Academic Publishers.

[Woo09] Woolridge, M & Jennings, NR (2009) Introduction. Intelligent Agents theory and practice

206

[Wra09] Wray, R. (May 2009). Internet data heads for 500bn gigabytes, (Electronic Version). Published on www.guardian.co.za Available from: http://www.guardian.co.uk/business/2009/may/18/digital-content-expansion (Accessed: 27/03/2010).

[Wre07] Wrembel, R. & Christian K. (2007). Data Warehouses and OLAP: Concepts, Architectures and Solutions.,IGI Global, ISBN 1599043645

[Yan11] Yang, C., Hsiao,C. & Chuang, L. (2011). Accelerated Chaotic Particle Swarm Optimization for Data Clustering.2009 International Conference on Machine Learning and Computing, IPCSIT vol.3 (2011) IACSIT Press, Singapore. Available from:www.ipcsit.com/vol3/045L015.pdf (Accessed 10/05/2012).

[Ye05] Ye,F.& Chen,C.(2005). Alternative KPSO-Clustering Algorithm (Abstract).Tamkang Journal of Science and Engineering, 8(2)165-174. Available from:www2.tku.edu.tw/~tkjse/8-2/8-2-9.pdf (Accessed 10/05/2012).

[Zam08] Zamalloayz, M., Rodriguez-Fuentesy, L.J., Penagarikanoy,M., Bordely,G. & Uribez, J.P.(2008 ). Feature Dimensionality reduction through genetic algorithms for faster speaker recognition (Abstract). 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008. Available from: gtts.ehu.es/gtts/NT/fulltext/ZamalloaEUSIPCO08b.pdf (Accessed 10/05/2012).

[Zan99] Zaïane,O.R. (1999). Introduction to Data Mining, Chapter I in CMPUT690 Principles of Knowledge Discovery in Databases. Available From: webdocs.cs.ualberta.ca/~zaiane/courses/.../Chapter1/index.html (Accessed 10/05/2012).

[Zon05] Ma, Z. (ed). (2005). Advances in Fuzzy Object-Oriented Databases: Modeling and Applications, IGI Global, ISBN 1591403855

207

Addendum A

Raw Iris Dataset sample

The Iris dataset is a multivariate dataset consisting of data that quantify the

morphologic variation of Iris flowers of three related species: I. setosa, I. versicolor and

I. virginica. The dataset consists of 150 examples, 50 of each plant and has four

attributes which are measured in centimetres as shown in Table A.1.

Table A.1: The Iris dataset

Sepal length Sepal width Petal length Petal width Species

5.1 3.5 1.4 0.2 I. setosa

4.9 3.0 1.4 0.2 I. setosa

4.7 3.2 1.3 0.2 I. setosa

4.6 3.1 1.5 0.2 I. setosa

5.0 3.6 1.4 0.2 I. setosa

5.4 3.9 1.7 0.4 I. setosa

4.6 3.4 1.4 0.3 I. setosa

5.0 3.4 1.5 0.2 I. setosa

4.4 2.9 1.4 0.2 I. setosa

4.9 3.1 1.5 0.1 I. setosa

5.4 3.7 1.5 0.2 I. setosa

4.8 3.4 1.6 0.2 I. setosa

4.8 3.0 1.4 0.1 I. setosa

208


4.3 3.0 1.1 0.1 I. setosa

5.8 4.0 1.2 0.2 I. setosa

5.7 4.4 1.5 0.4 I. setosa

5.4 3.9 1.3 0.4 I. setosa

5.1 3.5 1.4 0.3 I. setosa

5.7 3.8 1.7 0.3 I. setosa

5.1 3.8 1.5 0.3 I. setosa

5.4 3.4 1.7 0.2 I. setosa

5.1 3.7 1.5 0.4 I. setosa

4.6 3.6 1.0 0.2 I. setosa

5.1 3.3 1.7 0.5 I. setosa

4.8 3.4 1.9 0.2 I. setosa

5.0 3.0 1.6 0.2 I. setosa

5.1 3.8 1.6 0.2 I. setosa

4.6 3.2 1.4 0.2 I. setosa

5.3 3.7 1.5 0.2 I. setosa

5.0 3.3 1.4 0.2 I. setosa

7.0 3.2 4.7 1.4 I. versicolor

6.4 3.2 4.5 1.5 I. versicolor

209


6.9 3.1 4.9 1.5 I. versicolor

5.5 2.3 4.0 1.3 I. versicolor

6.5 2.8 4.6 1.5 I. versicolor

5.7 2.8 4.5 1.3 I. versicolor

6.3 3.3 4.7 1.6 I. versicolor

5.7 2.8 4.1 1.3 I. versicolor

6.3 3.3 6.0 2.5 I. virginica

5.8 2.7 5.1 1.9 I. virginica

7.1 3.0 5.9 2.1 I. virginica

6.3 2.9 5.6 1.8 I. virginica

6.5 3.0 5.8 2.2 I. virginica

7.6 3.0 6.6 2.1 I. virginica

4.9 2.5 4.5 1.7 I. virginica

7.3 2.9 6.3 1.8 I. virginica

6.7 2.5 5.8 1.8 I. virginica

6.5 3.0 5.2 2.0 I. virginica

6.2 3.4 5.4 2.3 I. virginica

5.9 3.0 5.1 1.8 I. virginica

210

Raw Glass Dataset sample

This glass dataset contains the data of six types of glass; defined in terms of their oxide

content (i.e. Na, Fe, K, etc). The types are:

building_windows_float_processed,

building_windows_non_float_processed,

vehicle_windows_float_processed,

containers, tableware and headlamps.

The dataset consists of 214 examples with 10 attributes and the following is a sample:

1, 1.52101,13.64,4.49,1.10,71.78,0.06,8.75, 0.00,0.00,1 2, 1.51761,13.89,3.60,1.36,72.73,0.48,7.83, 0.00,0.00,1 3, 1.51618,13.53,3.55,1.54,72.99,0.39,7.78, 0.00,0.00,1 101,1.51655,12.75,2.85,1.44,73.27,0.57,8.79, 0.11,0.22,2 102,1.51730,12.35,2.72,1.63,72.87,0.70,9.23, 0.00,0.00,2 103,1.51820,12.62,2.76,0.83,73.81,0.35,9.42, 0.00,0.20,2 152,1.52127,14.32,3.90,0.83,71.50,0.00,9.49, 0.00,0.00,3 153,1.51779,13.64,3.65,0.65,73.00,0.06,8.93, 0.00,0.00,3 154,1.51610,13.42,3.40,1.22,72.69,0.59,8.32, 0.00,0.00,3 164,1.51514,14.01,2.68,3.50,69.89,1.68,5.87, 2.20,0.00,5 165,1.51915,12.73,1.85,1.86,72.69,0.60,10.09,0.00,0.00,5 166,1.52171,11.56,1.88,1.56,72.86,0.47,11.41,0.00,0.00,5 183,1.51916,14.15,0.00,2.09,72.74,0.00,10.88,0.00,0.00,6 184,1.51969,14.56,0.00,0.56,73.48,0.00,11.22,0.00,0.00,6 185,1.51115,17.38,0.00,0.34,75.41,0.00,6.65, 0.00,0.00,6 200,1.51609,15.01,0.00,2.51,73.05,0.05,8.83, 0.53,0.00,7 201,1.51508,15.15,0.00,2.25,73.50,0.00,8.34, 0.63,0.00,7 202,1.51653,11.95,0.00,1.19,75.18,2.70,8.93, 0.00,0.00,7

The 2,4D Dataset

Soil samples were collected from a site that is contaminated with a pesticide known as

2,4-D (dichlorophenoxyacetic acid).

There are three classes, based on three genetically similar microbial isolates, which

show the ability to degrade 2,4-D. The dataset consists of 241 samples with 96

attributes.

211

Iris Dataset Relational structures

Figure A.1 Simple Iris relational structure

Figure A.1 represents the Iris dataset in a simple relational structure with the Flower

species as a foreign key.

Figure A.2 Complex Iris relational structure

Figure A.2 presents a more complex representation of the Iris dataset whereby each

attribute is a foreign key.