1. introductionshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · prediction:...
TRANSCRIPT
![Page 1: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/1.jpg)
1. INTRODUCTION
“ Computers promised a fountain of wisdom, but delivered a flood of data ”
- Statistician David Wishart
The rapid growth in technology and recent advances in storage capacity and
processing speeds has provided us with the ability to keep a virtually limitless
amount of data. The information available is no longer considered an asset. Daily
billions of business transactions are being recorded by banks, hotel chains, airlines,
retailers, telecommunications etc. Scientific, business and database technologies
ranging from simple relational systems to spatial, text or media keep accumulating
large quantities of data. Unless this accumulated data is analyzed properly, that data
becomes useless. The main challenge before us is the need of efficient techniques to
analyze the existing information and also extract and uncover useful and valuable
patterns from that information. The available information has to be utilized to gain a
better understanding of the past and to predict the future.
Traditional data analysis methods often involve manual work and also
interpretation of data becomes slow, expensive and highly subjective [Fayyad et. al.,
1996] creating many opportunities for errors. The dramatic increase in data volumes
and the mixed quality of data make the traditional approaches inappropriate and
impractical. Since human analysts can no longer make sense of enormous volumes of
data without any special tools data mining is used to automate the process of finding
![Page 2: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/2.jpg)
relationships and patterns in raw data and deliver results that can either be utilized in
an automated decision support system or assessed by a human analyst. Data mining
is a set of tools, techniques and methods that can be used to find new, hidden or
unexpected patterns from a large volume of data typically stored in a data
warehouse. The results obtained from data mining help in more effective individual
and group decision-making as it involves analyzing diverse data sources in order to
identify relationships, trends, deviations and other relevant information.
Actually data mining process consists of several interdependent steps as the
blind application of data mining techniques can easily lead to the discovery of
meaningless and invalid patterns. Data preprocessing involves data cleaning,
integration, selection, and transformation to make the data suitable for analysis; data
mining algorithm selection finds the patterns/build models; post-processing involves
evaluation of patterns and interpretation of the discovered knowledge to make the
results suitable for human analysis. Knowledge Discovery in Databases (KDD) is an
iterative process where once the patterns discovered are presented to the users, the
evaluation criteria can be enhanced, mining can be refined, and new data can be
further integrated, selected or transformed to get more appropriate results.
The kind of patterns discovered depends upon the type of data mining tasks
employed. Broadly, there are two types of data mining tasks: descriptive and
predictive.
Descriptive data mining tasks describe the general properties of the existing data.
Predictive data mining tasks try to make predictions based on inference on
available data.
![Page 3: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/3.jpg)
Regardless of the specific technique, data mining methods can be classified
by the function they perform or by their class of application as follows:
Characterization:
It is a summarization of general features of objects in a target class by producing
characteristic rules. The data which is relevant to a user-specified class can be
retrieved by a database query and run through a summarization module to extract the
essence of the data at different levels of abstractions.
Discrimination:
It is a comparison of the general features of objects between two classes referred
to as the target class and the contrasting class by producing discriminant rules.
Similar techniques are used for data characterization and data discrimination with the
exception that data discrimination results include comparative measures.
Classification:
Classification analysis organizes data into given classes and uses given class
labels for ordering the objects in the data collection.. It is also known as supervised
learning. Classification approaches normally use a training set where all objects are
already associated with known class labels. The classification algorithm learns from
the training set and builds a model and the model is used to classify new objects.
Some of the classification models are decision trees, neural networks, Bayesian
belief networks, support vector machines and genetic algorithms.
![Page 4: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/4.jpg)
Prediction:
There are two major types of predictions: prediction of unavailable data values
and prediction of a class label for some data. Once a classification model is built
based on a training set, the class label of an object can be foreseen based on the
attribute values of the object and the attribute values of the classes. Prediction is
based on the idea of using a large number of past values to predict probable future
values.
Clustering:
Similar to classification, clustering is the organization of data in classes.
However, unlike classification, in clustering, class labels are unknown and it is up to
the clustering algorithm to discover acceptable classes. Clustering is also called
unsupervised classification, because the classification is not dictated by given class
labels.
Association analysis:
Association analysis is the discovery of association rules. It studies the
frequency of items occurring together in transactional databases, and based on a
threshold called support, identifies the frequent item sets. Another threshold,
confidence, which is the conditional probability than an item appears in a transaction
when another item appears, is used to pinpoint association rules. So the goal is to
discover all association rules with support and confidence greater than or equal to the
![Page 5: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/5.jpg)
minimum support and confidence. Association analysis is commonly used for market
basket analysis.
Outlier analysis:
Outliers are data elements that cannot be grouped in a given class or cluster. Also
known as exceptions or surprises, they are often very important to identify. While
outliers can be considered noise and discarded in some applications, they can reveal
important knowledge in domains like intrusion detection, and thus can be very
significant making their analysis valuable.
Evolution and deviation analysis:
Evolution and deviation analysis pertain to the study of time related data that
changes in time. Evolution analysis models evolutionary trends in data, which
consent to characterizing, comparing, classifying or clustering of time related data.
Deviation analysis, on the other hand, considers differences between measured
values and expected values, to find the cause of the deviations from the anticipated
values. They can be applied in fraud detection, customer retention, forecasting etc.
Mining has attracted a significant amount of research attention due to its
usefulness in many applications, including decision support for business, science,
engineering and health care data, damage detection in engineering structures,
intrusion detection, diabetes diagnosis and detection, portfolio management,
selective marketing, and user profile analysis, brand loyalty to name a few.
![Page 6: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/6.jpg)
1.1 Data Clustering
The clustering problem can be described as follows:
Let W be a set of n entities, finding a partition of W into groups, such that the
entities within each group are similar to each other, while entities belonging to
different groups are dissimilar. The entities are usually described by a set of
measurements or attributes [Tao and Sheng, 2004].
Basic Concepts
Data clustering has always been an active and challenging research area in
data mining. The clustering problem has been addressed in many contexts and by
researchers in many disciplines which reflects its broad appeal and usefulness as one
of the steps in exploratory data analysis [Jain et. al., 1999].
Generally, clustering activity involves the following basic steps:
1. Pattern representation which includes feature extraction and / or selection.
2. Definition of a pattern proximity measure appropriate to the given dataset and the
type of attributes (numerical, categorical etc.) in that dataset.
3. Clustering
4. Data abstraction
5. Assessment of output
![Page 7: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/7.jpg)
The process of grouping, by finding similarities between entities according to
the characteristics of entities is called ‘clustering’. The groups are called ‘clusters’. A
cluster is a collection of entities such that the entities within a cluster are similar to
one another and dissimilar to the entities in other clusters.
Similarity and dissimilarity / distance measures are also referred to as
measures of proximity and are essential to most clustering procedures. The most
commonly used distance measure is the Euclidean metric which defines the
distance between two d- dimensional entities xi and xj
The Euclidean distance works well when a data set has “compact” or “isolated”
clusters [Mao and Jain 1996].
The dissimilarity between two entities xi and xj described in terms of p nominal
attributes can be computed using the formula given below
( )
where m is the number of matching attributes and p is the total number of attributes
describing the entities.
![Page 8: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/8.jpg)
The above metrics are not directly applicable to determine the membership of
an incoming entity into a non-uniformly distributed / growing cluster.
1.2 Major Clustering Approaches
A large number of clustering algorithms exist in the literature [Mirkin 1996].
The choice of clustering algorithm depends not only on the type of available data but
also on the particular purpose and application. Major clustering algorithms can be
classified broadly into the following categories [Dunham 2003].
1.2.1 Hierarchical Clustering
Hierarchical algorithms create a hierarchical decomposition of the given set
of data objects to build a cluster hierarchy or a tree of clusters and find successive
clusters using previously established clusters. The result of a hierarchical clustering
algorithm can be graphically displayed as a tree, called a dendogram. For data
clustering, this dendogram provides a taxonomy or hierarchical index. Every cluster
node contains child clusters and sibling clusters partition the points covered by their
common parent. So data can be explored on different levels of granularity. The
method can be further classified into agglomerative (bottom-up) or divisive (top-
down), based on how the hierarchical decomposition is formed.
![Page 9: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/9.jpg)
This technique can handle any form of similarity or distance and can be
applied to any attribute type but it does not contain any provision for the reallocation
of entities, which may have been poorly classified in the early stages as they do not
revisit once constructed clusters.
1.2.2 Partitional Clustering
Given a data base of n objects or data tuples, and k the number of clusters to
be formed, a partitioning method organizes the n objects into k partitions (k<= n).
Here each partition represents a cluster. A partitioning method first creates an initial
partitioning. It then uses an iterative relocation technique that attempts to improve
the partitioning by moving objects from one place to another so that the objects
within the same cluster are similar or related to each other, whereas objects of
different clusters are dissimilar in terms of database attributes.
These algorithms are advantageous for applications with large datasets as it is
difficult to construct dendogram for such datasets but disadvantageous for
applications where we cannot specify the number of desired partitions beforehand.
The most well-known partitioning clustering algorithms are: k-means, where
each cluster is represented by the mean value of the objects in the cluster and k-
medoids, where each cluster is represented by one of the objects located near the
center of the cluster.
![Page 10: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/10.jpg)
1.2.3 Density – based clustering
Density – based clustering algorithms have been developed to discover
clusters with arbitrary shapes. They consider clusters as dense regions of objects in
the data space that are separated by regions of low density. Some examples of
density – based algorithms include DBSCAN which grows clusters according to a
density threshold, OPTICS which computes an augmented cluster ordering for
automatic and interactive cluster analysis and DENCLUE which is based on a set of
density distribution functions.
1.2.4 Grid – based clustering
Grid – based clustering algorithms quantize the object space into a finite
number of cells that form a grid structure on which all operations for clustering are
performed. Some typical examples include STING, which explores statistical
information stored in the grid cells, WaveCluster which clusters objects using a
wavelet transform method and CLIQUE which represents a grid and density – based
approach for clustering in a high-dimensional dataspace.
1.2.5 Model – based clustering
Model – based clustering algorithms try to optimize the fit between the given
data and some mathematical model. They assume that the data are generated by a
finite mixture of underlying probability distributions such as multivariate normal
distributions.
![Page 11: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/11.jpg)
1.3 Successful algorithms for data clustering
One of the major research issues in the field of data clustering involves the
ability of a clustering algorithm to deal with entities described in terms of
heterogeneous attributes. Another research issue is the ability of a clustering
algorithm to deal with dynamically growing databases. In this direction, a brief
review of the existing clustering algorithms that can deal with either purely numeric
datasets or purely categorical datasets was presented.
1.3.1 Clustering numerical attributes
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is
designed to cluster large datasets of n-dimensional vectors using a limited
amount of main memory. It uses a special hierarchical data structure called CF
Tree with nodes representing cluster features to accommodate summary
information about sub-clusters of points and then clusters based on data summary
instead of the original dataset. Hence it comes under incremental clustering
algorithm for formation and maintenance of hierarchical clusters. A dense region
of points is treated as a single cluster and points in sparse regions are treated as
outliers. A single scan of the dataset yields a good clustering and one or more
additional passes can be optionally used to improve the quality further [Zhang T
et al. 1996].
![Page 12: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/12.jpg)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) uses a
density-based notion of clusters to discover clusters of arbitrary shape. Density-
based clustering methods typically regard clusters as dense regions of objects in
the data space that are separated by regions of low density (representing noise).
The key idea of DBSCAN is that, for each object of a cluster, the neighborhood
of a given radius has to contain at least a minimum number of data objects i.e the
density of the neighborhood must exceed a threshold. DBSCAN requires two
input parameters ε and Minpts [Ester et al.1996].
CURE (Clustering Using REpresentatives) is more robust to outliers, and
identifies clusters having non-spherical shapes and wide variances in size. It
achieves this by representing each cluster by a certain fixed number of points that
are generated by selecting well scattered points from the cluster and then
shrinking them toward the center of the cluster by a specified fraction. Having
more than one representative point per cluster allows CURE to adjust well to the
geometry of non-spherical shapes and the shrinking helps to gradually reduce the
effects of outliers. To handle large databases, it employs a combination of
random sampling and partitioning. A random sample drawn from the data set is
first partitioned and each partition is partially clustered. The partial clusters are
then clustered while eliminating outliers in a second pass to yield the desired
clusters. This scales well for large databases without sacrificing clustering quality
[Guha et al.1998].
![Page 13: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/13.jpg)
1.3.2 Clustering Categorical attributes
A categorical attribute/variable has a limited number of values. Each value
usually represents a distinct label or category. Categorical (or qualitative) variables
identify categorical properties. The values of categorical attributes differ in kind,
whereas quantitative/numerical variables differ in quantity or distance.
A categorical variable is called nominal if there is no natural ordering of the
categories. Examples are gender, race, religion, or sport. When the categories may be
ordered, these are called ordinal variables. Categorical variables that judge size
(small, medium, large), attitudes (strongly disagree, disagree, neutral, agree, strongly
agree) are ordinal variables. A binary variable is considered as a categorical variable
with two values.
Some of the clustering algorithms which deal with categorical variables are:
COBWEB [Fisher 1987] produces a hierarchy of classes and organizes
observations into a classification tree. Its incremental nature allows clustering of
new data to be made without having to repeat the clustering already made. It uses
a heuristic measure called category utility to guide search. COBWEB has been
discussed in detail in the literature survey chapter.
ROCK ( Robust Clustering using linKs) is an adaptation of an agglomerative
hierarchical clustering algorithm, which heuristically optimizes a criterion
function defined in terms of the number of “links” between tuples [GRS99].
![Page 14: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/14.jpg)
Informally, the number of links between two tuples is the number of common
neighbors they have in the dataset. Starting with each tuple in its own cluster,
they repeatedly merge the two closest clusters till the required number (say, K) of
clusters remains. The algorithm is cubic in the number of tuples in the dataset,
they cluster a sample randomly drawn from the dataset, and then partition the
entire dataset based on the clusters from the sample. Beyond that the set of all
“clusters” together may optimize a criterion function, the set of tuples in each
individual cluster is not characterized. The principle of ROCK lies in
maximization of the function which takes into account both maximization of
sums of links for the objects from the same cluster, and minimization of sums of
links for the objects from different clusters [Guha et al.1999].
CACTUS (Clustering Categorical Data using Summaries) The central idea
behind CACTUS is that a summary of the entire dataset is sufficient to compute
a set of “candidate” clusters which can then be validated to determine the actual
set of clusters. It is based on the idea of the common occurrences of certain
categories of different variables. If the difference in the number of occurrences
for the categories vkt and vlu of the k-th and l-th variable, and the expected
frequency is greater than a user-defined threshold, the categories are strongly
connected. CACTUS consists of three phases: summarization, clustering, and
validation. In the summarization phase, we compute the summary information
from the dataset. In the clustering phase, we use the summary information to
![Page 15: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/15.jpg)
discover a set of candidate clusters. In the validation phase, we determine the
actual set of clusters from the set of candidate clusters.
These approaches are suitable for clustering static databases only.
1.3.3 Limitations of Existing approaches
1) Existing approaches assume that all the data in the database can be fit into the
memory of a computer system so that the data can be processed. This does not
hold good for very large databases.
2) Most of the algorithms are applicable to only static datasets which may lead to
incorrect results when applied on dynamic data.
3) In a data warehouse environment updates are done periodically and consequently
the clustering solution derived from the warehouse have to be updated
periodically. The existing algorithms as they can handle only static database
should be re-run on the entire contents of the data warehouse.
4) It is inefficient and time-consuming to rescan the entire database each and every
time an update occurs.
1.4 Incremental Clustering
Incremental clustering is the process of updating an existing set of clusters
incrementally rather than mining them from the scratch on every database update.
![Page 16: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/16.jpg)
The conventional approach to incremental mining involves data
abstraction/summarization to compress the database incrementally and the outcome
of summarization is clustered using slightly modified versions of existing clustering
algorithms [Jain et al. 1999].
Any typical incremental clustering algorithm has the following steps:
Step 1 – Assign the first data item to a cluster and represent the cluster summary.
Step 2 - Assign a new data item to a closest cluster based on its proximity to various
cluster summaries and update the cluster summary accordingly.
Step 3 - Repeat Step 2 upon the arrival of new data items.
It may be observed that incremental clustering is advantageous for clustering
large static databases as it does not require the storage of the entire data matrix in the
memory. As a result of this, incremental algorithms require very less space. Also,
incremental clustering is preferred due to the fact that usage of main memory is low
and keeping mutual distances between data objects in memory is not essential. In
addition to this, the incremental clustering algorithms are scalable in terms of size of
the data objects and number of attributes [Dan and Namita 2004]. The time
requirement of these algorithms is less as they avoid redundant processing of the
whole dataset by simply updating the cluster summaries upon arrival of new data.
![Page 17: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/17.jpg)
1.5 Motivation
The main aim of incremental clustering algorithms is to deal with
dynamically growing datasets. An incremental clustering algorithm has to maintain
the cluster prototype/summary as long as the same concept sustains and also refresh
or modify the cluster prototype/summary when there is a drift in the concept.
Because of these capabilities to maintain as well as refresh the cluster
prototype/summary according to the need, it is found that incremental clustering is
more suitable for modeling the growth patterns in dynamically changing
environments. In general, the growth pattern of a cluster solution is expected to be
uniform over all the clusters and each cluster contains uniformly distributed data
entities.
For clusters with uniformly distributed entities, the usual distance measures
like Euclidean distance hold good for deciding the membership of an incoming data
point into an existing cluster. But there exist applications where clusters have non-
uniformly distributed data entities. Specifically, while modeling the growth patterns
of neighborhoods in an urban area the shape of the cluster must be non-globular as a
neighborhood consists of a smaller densely populated area (down town) and a larger
sparsely populated area (posh localities). As the city grows new colonies are
expected to develop adjacent to the sparser side of the city while vertical growth or
infiltration into sparser area is expected on the denser side to keep pace with
additional population in the city. As a consequence of this, the growth of the city
appears to be bounded on the denser side and open towards the sparser side. None of
![Page 18: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/18.jpg)
the existing incremental clustering algorithms can simulate the growth pattern of
such situations. Hence this research work is motivated to deal with such situations.
1.6 Problem Statement
This research work aims to develop new algorithms and proximity metrics for
the formation and maintenance of incremental clusters to simulate the growth pattern
of clusters with non-uniformly distributed entities. Since normal Euclidean Distance
metric will not discriminate the data points existing on denser side from the sparser
side, the author has proposed a new proximity metric namely, Inverse Proximity
Estimate (IPE) which is capable of discriminating the data points as per the
requirement of simulating the growth pattern of a non-uniformly distributed cluster.
According to Charikar , the incremental clustering problem can be defined as
follows: “For an update sequence of n points in M, maintain a collection of k
clusters such that as each input point is presented, either it is assigned to one of the
current k clusters or it starts off a new cluster while two existing clusters are merged
into one” [Charikar et al.1997].
The algorithms CFICA and M-CFICA proposed by the author to deal with
numeric datasets and heterogeneous datasets respectively for formation of partitional
clusters handles the three primary tasks mentioned above namely assigning a data
point into a cluster, starting a new cluster and merging existing clusters in addition to
refreshing the cluster solution upon concept drift in the specific context of non-
uniformly distributed clusters.
![Page 19: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/19.jpg)
1.7 Original Contributions of the Research Work
The main contributions of this research are as follows:
Recognized that non-uniformly distributed clusters are also significant in
applications like modeling the growth pattern of neighborhoods in urban areas.
Investigated the suitability of conventional algorithms for formation of clusters
with non-uniformly distributed entities.
Developed a new proximity metric called ‘Inverse Proximity Estimate’ (IPE) to
determine the proximity of an entity `to a cluster with non-uniformly distributed
entities.
Proposed a methodology called CFICA to cluster dynamic datasets with entities
described in terms of only numerical attributes.
Designed an information theoretic approach to estimate the Normalized
Dissimilarity (DS) between two entities described in terms of categorical
attributes and combined with normalized Euclidean Distance (NED) to devise an
estimate referred to as Mixed Distance (MD) to deal with all types of attributes.
Developed a new approach to incremental clustering called M-CFICA to deal
with entities described in terms of heterogeneous attributes.
1.8 Organization of the Thesis
The complete work presented in the subsequent chapters is as outlined below:
Chapter 2 provides the literature survey done for understanding the research
problem. Various incremental clustering algorithms for mining dynamic and large
![Page 20: 1. INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/28762/7/07_chapter 1.pdf · Prediction: There are two major types of predictions: prediction of unavailable data values and](https://reader033.vdocuments.us/reader033/viewer/2022060213/5f0562597e708231d412b19b/html5/thumbnails/20.jpg)
datasets, numerical data, categorical data, mixed data and for other applications have
been presented.
Chapter 3 proposes a new methodology called CFICA to cluster dynamic sets
incrementally and the need of developing a novel proximity metric called Inverse
Proximity Estimate, IPE. In this chapter, the details of the proposed algorithm and
metric are described in detail.
Chapter 4 proposes another new methodology called MCFICA to deal with mixed
data. An informatics theoretic approach is designed to estimate the Normalized
Dissimilarity (DS) between two entities. This is combined with Normalized
Euclidean Distance (NED) to devise an estimate referred to as Mixed Distance (MD)
to deal with all types of attributes. Mixed distance measure is devised to estimate the
distance between two data points in terms of numerical & categorical attributes
separately.
Chapter 5 presents the experimental analysis and results. In order to analyze the
accuracy of our proposed approach, real – world datasets from UCI machine learning
repository as well as hypothetical datasets have been used.
Chapter 6 concludes the work with a brief summary of the proposed algorithm and
the possible extensions of the work.