1. introductionshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · prediction:...

1. INTRODUCTION

“ Computers promised a fountain of wisdom, but delivered a flood of data ”

- Statistician David Wishart

The rapid growth in technology and recent advances in storage capacity and

processing speeds has provided us with the ability to keep a virtually limitless

amount of data. The information available is no longer considered an asset. Daily

billions of business transactions are being recorded by banks, hotel chains, airlines,

retailers, telecommunications etc. Scientific, business and database technologies

ranging from simple relational systems to spatial, text or media keep accumulating

large quantities of data. Unless this accumulated data is analyzed properly, that data

becomes useless. The main challenge before us is the need of efficient techniques to

analyze the existing information and also extract and uncover useful and valuable

patterns from that information. The available information has to be utilized to gain a

better understanding of the past and to predict the future.

Traditional data analysis methods often involve manual work and also

interpretation of data becomes slow, expensive and highly subjective [Fayyad et. al.,

1996] creating many opportunities for errors. The dramatic increase in data volumes

and the mixed quality of data make the traditional approaches inappropriate and

impractical. Since human analysts can no longer make sense of enormous volumes of

data without any special tools data mining is used to automate the process of finding

relationships and patterns in raw data and deliver results that can either be utilized in

an automated decision support system or assessed by a human analyst. Data mining

is a set of tools, techniques and methods that can be used to find new, hidden or

unexpected patterns from a large volume of data typically stored in a data

warehouse. The results obtained from data mining help in more effective individual

and group decision-making as it involves analyzing diverse data sources in order to

identify relationships, trends, deviations and other relevant information.

Actually data mining process consists of several interdependent steps as the

blind application of data mining techniques can easily lead to the discovery of

meaningless and invalid patterns. Data preprocessing involves data cleaning,

integration, selection, and transformation to make the data suitable for analysis; data

mining algorithm selection finds the patterns/build models; post-processing involves

evaluation of patterns and interpretation of the discovered knowledge to make the

results suitable for human analysis. Knowledge Discovery in Databases (KDD) is an

iterative process where once the patterns discovered are presented to the users, the

evaluation criteria can be enhanced, mining can be refined, and new data can be

further integrated, selected or transformed to get more appropriate results.

The kind of patterns discovered depends upon the type of data mining tasks

employed. Broadly, there are two types of data mining tasks: descriptive and

predictive.

Descriptive data mining tasks describe the general properties of the existing data.

Predictive data mining tasks try to make predictions based on inference on

available data.

Regardless of the specific technique, data mining methods can be classified

by the function they perform or by their class of application as follows:

Characterization:

It is a summarization of general features of objects in a target class by producing

characteristic rules. The data which is relevant to a user-specified class can be

retrieved by a database query and run through a summarization module to extract the

essence of the data at different levels of abstractions.

Discrimination:

It is a comparison of the general features of objects between two classes referred

to as the target class and the contrasting class by producing discriminant rules.

Similar techniques are used for data characterization and data discrimination with the

exception that data discrimination results include comparative measures.

Classification:

Classification analysis organizes data into given classes and uses given class

labels for ordering the objects in the data collection.. It is also known as supervised

learning. Classification approaches normally use a training set where all objects are

already associated with known class labels. The classification algorithm learns from

the training set and builds a model and the model is used to classify new objects.

Some of the classification models are decision trees, neural networks, Bayesian

belief networks, support vector machines and genetic algorithms.

Prediction:

There are two major types of predictions: prediction of unavailable data values

and prediction of a class label for some data. Once a classification model is built

based on a training set, the class label of an object can be foreseen based on the

attribute values of the object and the attribute values of the classes. Prediction is

based on the idea of using a large number of past values to predict probable future

values.

Clustering:

Similar to classification, clustering is the organization of data in classes.

However, unlike classification, in clustering, class labels are unknown and it is up to

the clustering algorithm to discover acceptable classes. Clustering is also called

unsupervised classification, because the classification is not dictated by given class

labels.

Association analysis:

Association analysis is the discovery of association rules. It studies the

frequency of items occurring together in transactional databases, and based on a

threshold called support, identifies the frequent item sets. Another threshold,

confidence, which is the conditional probability than an item appears in a transaction

when another item appears, is used to pinpoint association rules. So the goal is to

discover all association rules with support and confidence greater than or equal to the

minimum support and confidence. Association analysis is commonly used for market

basket analysis.

Outlier analysis:

Outliers are data elements that cannot be grouped in a given class or cluster. Also

known as exceptions or surprises, they are often very important to identify. While

outliers can be considered noise and discarded in some applications, they can reveal

important knowledge in domains like intrusion detection, and thus can be very

significant making their analysis valuable.

Evolution and deviation analysis:

Evolution and deviation analysis pertain to the study of time related data that

changes in time. Evolution analysis models evolutionary trends in data, which

consent to characterizing, comparing, classifying or clustering of time related data.

Deviation analysis, on the other hand, considers differences between measured

values and expected values, to find the cause of the deviations from the anticipated

values. They can be applied in fraud detection, customer retention, forecasting etc.

Mining has attracted a significant amount of research attention due to its

usefulness in many applications, including decision support for business, science,

engineering and health care data, damage detection in engineering structures,

intrusion detection, diabetes diagnosis and detection, portfolio management,

selective marketing, and user profile analysis, brand loyalty to name a few.

1.1 Data Clustering

The clustering problem can be described as follows:

Let W be a set of n entities, finding a partition of W into groups, such that the

entities within each group are similar to each other, while entities belonging to

different groups are dissimilar. The entities are usually described by a set of

measurements or attributes [Tao and Sheng, 2004].

Basic Concepts

Data clustering has always been an active and challenging research area in

data mining. The clustering problem has been addressed in many contexts and by

researchers in many disciplines which reflects its broad appeal and usefulness as one

of the steps in exploratory data analysis [Jain et. al., 1999].

Generally, clustering activity involves the following basic steps:

1. Pattern representation which includes feature extraction and / or selection.

2. Definition of a pattern proximity measure appropriate to the given dataset and the

type of attributes (numerical, categorical etc.) in that dataset.

3. Clustering

4. Data abstraction

5. Assessment of output

The process of grouping, by finding similarities between entities according to

the characteristics of entities is called ‘clustering’. The groups are called ‘clusters’. A

cluster is a collection of entities such that the entities within a cluster are similar to

one another and dissimilar to the entities in other clusters.

Similarity and dissimilarity / distance measures are also referred to as

measures of proximity and are essential to most clustering procedures. The most

commonly used distance measure is the Euclidean metric which defines the

distance between two d- dimensional entities xi and xj

The Euclidean distance works well when a data set has “compact” or “isolated”

clusters [Mao and Jain 1996].

The dissimilarity between two entities xi and xj described in terms of p nominal

attributes can be computed using the formula given below

( )

where m is the number of matching attributes and p is the total number of attributes

describing the entities.

The above metrics are not directly applicable to determine the membership of

an incoming entity into a non-uniformly distributed / growing cluster.

1.2 Major Clustering Approaches

A large number of clustering algorithms exist in the literature [Mirkin 1996].

The choice of clustering algorithm depends not only on the type of available data but

also on the particular purpose and application. Major clustering algorithms can be

classified broadly into the following categories [Dunham 2003].

1.2.1 Hierarchical Clustering

Hierarchical algorithms create a hierarchical decomposition of the given set

of data objects to build a cluster hierarchy or a tree of clusters and find successive

clusters using previously established clusters. The result of a hierarchical clustering

algorithm can be graphically displayed as a tree, called a dendogram. For data

clustering, this dendogram provides a taxonomy or hierarchical index. Every cluster

node contains child clusters and sibling clusters partition the points covered by their

common parent. So data can be explored on different levels of granularity. The

method can be further classified into agglomerative (bottom-up) or divisive (top-

down), based on how the hierarchical decomposition is formed.

This technique can handle any form of similarity or distance and can be

applied to any attribute type but it does not contain any provision for the reallocation

of entities, which may have been poorly classified in the early stages as they do not

revisit once constructed clusters.

1.2.2 Partitional Clustering

Given a data base of n objects or data tuples, and k the number of clusters to

be formed, a partitioning method organizes the n objects into k partitions (k<= n).

Here each partition represents a cluster. A partitioning method first creates an initial

partitioning. It then uses an iterative relocation technique that attempts to improve

the partitioning by moving objects from one place to another so that the objects

within the same cluster are similar or related to each other, whereas objects of

different clusters are dissimilar in terms of database attributes.

These algorithms are advantageous for applications with large datasets as it is

difficult to construct dendogram for such datasets but disadvantageous for

applications where we cannot specify the number of desired partitions beforehand.

The most well-known partitioning clustering algorithms are: k-means, where

each cluster is represented by the mean value of the objects in the cluster and k-

medoids, where each cluster is represented by one of the objects located near the

center of the cluster.

1.2.3 Density – based clustering

Density – based clustering algorithms have been developed to discover

clusters with arbitrary shapes. They consider clusters as dense regions of objects in

the data space that are separated by regions of low density. Some examples of

density – based algorithms include DBSCAN which grows clusters according to a

density threshold, OPTICS which computes an augmented cluster ordering for

automatic and interactive cluster analysis and DENCLUE which is based on a set of

density distribution functions.

1.2.4 Grid – based clustering

Grid – based clustering algorithms quantize the object space into a finite

number of cells that form a grid structure on which all operations for clustering are

performed. Some typical examples include STING, which explores statistical

information stored in the grid cells, WaveCluster which clusters objects using a

wavelet transform method and CLIQUE which represents a grid and density – based

approach for clustering in a high-dimensional dataspace.

1.2.5 Model – based clustering

Model – based clustering algorithms try to optimize the fit between the given

data and some mathematical model. They assume that the data are generated by a

finite mixture of underlying probability distributions such as multivariate normal

distributions.

1.3 Successful algorithms for data clustering

One of the major research issues in the field of data clustering involves the

ability of a clustering algorithm to deal with entities described in terms of

heterogeneous attributes. Another research issue is the ability of a clustering

algorithm to deal with dynamically growing databases. In this direction, a brief

review of the existing clustering algorithms that can deal with either purely numeric

datasets or purely categorical datasets was presented.

1.3.1 Clustering numerical attributes

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is

designed to cluster large datasets of n-dimensional vectors using a limited

amount of main memory. It uses a special hierarchical data structure called CF

Tree with nodes representing cluster features to accommodate summary

information about sub-clusters of points and then clusters based on data summary

instead of the original dataset. Hence it comes under incremental clustering

algorithm for formation and maintenance of hierarchical clusters. A dense region

of points is treated as a single cluster and points in sparse regions are treated as

outliers. A single scan of the dataset yields a good clustering and one or more

additional passes can be optionally used to improve the quality further [Zhang T

et al. 1996].

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) uses a

density-based notion of clusters to discover clusters of arbitrary shape. Density-

based clustering methods typically regard clusters as dense regions of objects in

the data space that are separated by regions of low density (representing noise).

The key idea of DBSCAN is that, for each object of a cluster, the neighborhood

of a given radius has to contain at least a minimum number of data objects i.e the

density of the neighborhood must exceed a threshold. DBSCAN requires two

input parameters ε and Minpts [Ester et al.1996].

CURE (Clustering Using REpresentatives) is more robust to outliers, and

identifies clusters having non-spherical shapes and wide variances in size. It

achieves this by representing each cluster by a certain fixed number of points that

are generated by selecting well scattered points from the cluster and then

shrinking them toward the center of the cluster by a specified fraction. Having

more than one representative point per cluster allows CURE to adjust well to the

geometry of non-spherical shapes and the shrinking helps to gradually reduce the

effects of outliers. To handle large databases, it employs a combination of

random sampling and partitioning. A random sample drawn from the data set is

first partitioned and each partition is partially clustered. The partial clusters are

then clustered while eliminating outliers in a second pass to yield the desired

clusters. This scales well for large databases without sacrificing clustering quality

[Guha et al.1998].

1.3.2 Clustering Categorical attributes

A categorical attribute/variable has a limited number of values. Each value

usually represents a distinct label or category. Categorical (or qualitative) variables

identify categorical properties. The values of categorical attributes differ in kind,

whereas quantitative/numerical variables differ in quantity or distance.

A categorical variable is called nominal if there is no natural ordering of the

categories. Examples are gender, race, religion, or sport. When the categories may be

ordered, these are called ordinal variables. Categorical variables that judge size

(small, medium, large), attitudes (strongly disagree, disagree, neutral, agree, strongly

agree) are ordinal variables. A binary variable is considered as a categorical variable

with two values.

Some of the clustering algorithms which deal with categorical variables are:

COBWEB [Fisher 1987] produces a hierarchy of classes and organizes

observations into a classification tree. Its incremental nature allows clustering of

new data to be made without having to repeat the clustering already made. It uses

a heuristic measure called category utility to guide search. COBWEB has been

discussed in detail in the literature survey chapter.

ROCK ( Robust Clustering using linKs) is an adaptation of an agglomerative

hierarchical clustering algorithm, which heuristically optimizes a criterion

function defined in terms of the number of “links” between tuples [GRS99].

Informally, the number of links between two tuples is the number of common

neighbors they have in the dataset. Starting with each tuple in its own cluster,

they repeatedly merge the two closest clusters till the required number (say, K) of

clusters remains. The algorithm is cubic in the number of tuples in the dataset,

they cluster a sample randomly drawn from the dataset, and then partition the

entire dataset based on the clusters from the sample. Beyond that the set of all

“clusters” together may optimize a criterion function, the set of tuples in each

individual cluster is not characterized. The principle of ROCK lies in

maximization of the function which takes into account both maximization of

sums of links for the objects from the same cluster, and minimization of sums of

links for the objects from different clusters [Guha et al.1999].

CACTUS (Clustering Categorical Data using Summaries) The central idea

behind CACTUS is that a summary of the entire dataset is sufficient to compute

a set of “candidate” clusters which can then be validated to determine the actual

set of clusters. It is based on the idea of the common occurrences of certain

categories of different variables. If the difference in the number of occurrences

for the categories vkt and vlu of the k-th and l-th variable, and the expected

frequency is greater than a user-defined threshold, the categories are strongly

connected. CACTUS consists of three phases: summarization, clustering, and

validation. In the summarization phase, we compute the summary information

from the dataset. In the clustering phase, we use the summary information to

discover a set of candidate clusters. In the validation phase, we determine the

actual set of clusters from the set of candidate clusters.

These approaches are suitable for clustering static databases only.

1.3.3 Limitations of Existing approaches

1) Existing approaches assume that all the data in the database can be fit into the

memory of a computer system so that the data can be processed. This does not

hold good for very large databases.

2) Most of the algorithms are applicable to only static datasets which may lead to

incorrect results when applied on dynamic data.

3) In a data warehouse environment updates are done periodically and consequently

the clustering solution derived from the warehouse have to be updated

periodically. The existing algorithms as they can handle only static database

should be re-run on the entire contents of the data warehouse.

4) It is inefficient and time-consuming to rescan the entire database each and every

time an update occurs.

1.4 Incremental Clustering

Incremental clustering is the process of updating an existing set of clusters

incrementally rather than mining them from the scratch on every database update.

The conventional approach to incremental mining involves data

abstraction/summarization to compress the database incrementally and the outcome

of summarization is clustered using slightly modified versions of existing clustering

algorithms [Jain et al. 1999].

Any typical incremental clustering algorithm has the following steps:

Step 1 – Assign the first data item to a cluster and represent the cluster summary.

Step 2 - Assign a new data item to a closest cluster based on its proximity to various

cluster summaries and update the cluster summary accordingly.

Step 3 - Repeat Step 2 upon the arrival of new data items.

It may be observed that incremental clustering is advantageous for clustering

large static databases as it does not require the storage of the entire data matrix in the

memory. As a result of this, incremental algorithms require very less space. Also,

incremental clustering is preferred due to the fact that usage of main memory is low

and keeping mutual distances between data objects in memory is not essential. In

addition to this, the incremental clustering algorithms are scalable in terms of size of

the data objects and number of attributes [Dan and Namita 2004]. The time

requirement of these algorithms is less as they avoid redundant processing of the

whole dataset by simply updating the cluster summaries upon arrival of new data.

1.5 Motivation

The main aim of incremental clustering algorithms is to deal with

dynamically growing datasets. An incremental clustering algorithm has to maintain

the cluster prototype/summary as long as the same concept sustains and also refresh

or modify the cluster prototype/summary when there is a drift in the concept.

Because of these capabilities to maintain as well as refresh the cluster

prototype/summary according to the need, it is found that incremental clustering is

more suitable for modeling the growth patterns in dynamically changing

environments. In general, the growth pattern of a cluster solution is expected to be

uniform over all the clusters and each cluster contains uniformly distributed data

entities.

For clusters with uniformly distributed entities, the usual distance measures

like Euclidean distance hold good for deciding the membership of an incoming data

point into an existing cluster. But there exist applications where clusters have non-

uniformly distributed data entities. Specifically, while modeling the growth patterns

of neighborhoods in an urban area the shape of the cluster must be non-globular as a

neighborhood consists of a smaller densely populated area (down town) and a larger

sparsely populated area (posh localities). As the city grows new colonies are

expected to develop adjacent to the sparser side of the city while vertical growth or

infiltration into sparser area is expected on the denser side to keep pace with

additional population in the city. As a consequence of this, the growth of the city

appears to be bounded on the denser side and open towards the sparser side. None of

the existing incremental clustering algorithms can simulate the growth pattern of

such situations. Hence this research work is motivated to deal with such situations.

1.6 Problem Statement

This research work aims to develop new algorithms and proximity metrics for

the formation and maintenance of incremental clusters to simulate the growth pattern

of clusters with non-uniformly distributed entities. Since normal Euclidean Distance

metric will not discriminate the data points existing on denser side from the sparser

side, the author has proposed a new proximity metric namely, Inverse Proximity

Estimate (IPE) which is capable of discriminating the data points as per the

requirement of simulating the growth pattern of a non-uniformly distributed cluster.

According to Charikar , the incremental clustering problem can be defined as

follows: “For an update sequence of n points in M, maintain a collection of k

clusters such that as each input point is presented, either it is assigned to one of the

current k clusters or it starts off a new cluster while two existing clusters are merged

into one” [Charikar et al.1997].

The algorithms CFICA and M-CFICA proposed by the author to deal with

numeric datasets and heterogeneous datasets respectively for formation of partitional

clusters handles the three primary tasks mentioned above namely assigning a data

point into a cluster, starting a new cluster and merging existing clusters in addition to

refreshing the cluster solution upon concept drift in the specific context of non-

uniformly distributed clusters.

1.7 Original Contributions of the Research Work

The main contributions of this research are as follows:

Recognized that non-uniformly distributed clusters are also significant in

applications like modeling the growth pattern of neighborhoods in urban areas.

Investigated the suitability of conventional algorithms for formation of clusters

with non-uniformly distributed entities.

Developed a new proximity metric called ‘Inverse Proximity Estimate’ (IPE) to

determine the proximity of an entity `to a cluster with non-uniformly distributed

entities.

Proposed a methodology called CFICA to cluster dynamic datasets with entities

described in terms of only numerical attributes.

Designed an information theoretic approach to estimate the Normalized

Dissimilarity (DS) between two entities described in terms of categorical

attributes and combined with normalized Euclidean Distance (NED) to devise an

estimate referred to as Mixed Distance (MD) to deal with all types of attributes.

Developed a new approach to incremental clustering called M-CFICA to deal

with entities described in terms of heterogeneous attributes.

1.8 Organization of the Thesis

The complete work presented in the subsequent chapters is as outlined below:

Chapter 2 provides the literature survey done for understanding the research

problem. Various incremental clustering algorithms for mining dynamic and large

datasets, numerical data, categorical data, mixed data and for other applications have

been presented.

Chapter 3 proposes a new methodology called CFICA to cluster dynamic sets

incrementally and the need of developing a novel proximity metric called Inverse

Proximity Estimate, IPE. In this chapter, the details of the proposed algorithm and

metric are described in detail.

Chapter 4 proposes another new methodology called MCFICA to deal with mixed

data. An informatics theoretic approach is designed to estimate the Normalized

Dissimilarity (DS) between two entities. This is combined with Normalized

Euclidean Distance (NED) to devise an estimate referred to as Mixed Distance (MD)

to deal with all types of attributes. Mixed distance measure is devised to estimate the

distance between two data points in terms of numerical & categorical attributes

separately.

Chapter 5 presents the experimental analysis and results. In order to analyze the

accuracy of our proposed approach, real – world datasets from UCI machine learning

repository as well as hypothetical datasets have been used.

Chapter 6 concludes the work with a brief summary of the proposed algorithm and

the possible extensions of the work.

1. introductionshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · prediction:...

Documents