-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
1/16
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
2/16
difficult to extract knowledge from the given dataset. On the other hand, in s multi-step tool the data mining tasks
clustering, classification and visualization are unified and the tool looks like a single-step, provides the knowledge
as output.
The rest of the paper is organized as follows; section 2 deals with the Data Mining Tools, section 3 is about the
comparison of tools and results are discussed in section 4 and finally the conclusion is drawn in section 5.
2. Data Mining Tools
In this section we discuss the single-step data mining tools namely ODM and MS SQL Server and a multi-step data
mining tool called UDMTool.
2.1 Oracle Data Mining (ODM)
The architecture of ODM is based on the Cross Industry Standard Process for Data Mining (CRISP-DM) model
which was founded in 1997 and funded by the European Commission. The main idea was to define an industry
standard for data mining [9]. The CRISP-DM process is shown below:
Business Understanding Data Understanding Data PreparationModeling Evaluation Deployment
There are six steps in CRISP-DM process model. The ODM implements and supports the last three steps of CRISP-
DM model. The main components of the MS SQL Server are shown below:
Data SourceModeling Evaluation and Deployment
The data mining is an iterative process, the process continues after a solution is deployed. The lessons learned
during the process can trigger new business questions. Any change in the data can require new models. The
subsequent data mining processes benefit from the experiences of previous ones. The remaining steps are supported
by a combination of the ODM and the Oracle database, especially in the context of an Oracle data warehouse. The
facilities of the Oracle database can be very useful during data understanding and data preparation. The ODM
integrates data mining with the Oracle database and exposes data mining through the interfaces namely, Java
interface, PL/SQL interface, an Automated data mining, the Data mining SQL functions and the Graphical
interfaces. The ODM supports data mining model export and import in native format between Oracle databases or
schemas to provide a way to move models [9][10][13]. The workflow of ODM is illustrated in figure 1.
Figure 1. The Workflow of the ODM
The figure 1 depicts the workflow of the ODM. The data source is the dataset, explore data is the viewing the dataset
and selection of model is the data mining models such as clustering, classification, association and feature
extraction. These are the required components to do mining in the ODM. The next phase is to apply the model on
the dataset and finally store the results in a separate table for further processing. The user can apply only two
components data source and model and build the model. The rest of the components are just to facilitate the user.
2.2 MS SQL Server
The MS SQL Server also uses the Cross Industry Standard Process for Data Mining (CRISP-DM) model.
Business Understanding Data Understanding Data PreparationModeling Evaluation Deployment
The data mining is a process that involves the interaction of multiple components. In MS SQL Server one can access
the sources of data in a SQL Server database or any other data source to use for training, testing, or prediction,
define the data mining structures and models by using Business Intelligence Development Studio or Visual Studio
2008 and the data mining objects are managed, create the predictions and the queries by using SQL Server
Management Studio. After the completion of the solution, deploy it to an instance of Analysis Services. The main
components of the MS SQL Server are shown below:
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 27
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
3/16
Data Source Data Mining Structure Data Mining Models Deployment
In MS SQL Server the data mining can be done quickly and easily on relational data tables, or any other data source
that has been defined as an Analysis Services data source view. The MS SQL Server 2008 Analysis Services also
provides the ability to separate the data into training and testing datasets. A data mining structure is a logical data
structure that defines the data domain from which mining models are built. A single mining structure can support
multiple mining models that share the same domain. The data mining structure can also be partitioned into a training
and test dataset. This partitioning can be done automatically when the data mining structure is defined. A datamining model represents a combination of data, a data mining algorithm, and a collection of parameter and filter
settings that affect the data used and how the data is processed. The ultimate goal of data mining development is to
create a model that can be used by end users [12][14].
2.3 The Unified Data Mining Tool (UDMTool)
The Unified Data Mining Tool (UDMTool) is a new and better next generation solution based on the UDMT which
is a unified way of architecting and building software solutions by integrating different data mining tasks. The
foundation of the UDMTool is that the Knowledge can only be obtained if the data mining processes such as
clustering, classification and visualization are unified which is also called the Unified Data mining Theory (UDMT)
i.e. the Knowledge can be extracted from a given dataset after passing through all the data mining processes. This
is illustrated in equation (1).
ionVisulaizattionClassificaClusteringKnowledge (1)
It can be written as in equation (2).
CBAK (2)
WhereA is the clustering,B is the classification, Cis the visualization and Kis the knowledge.
The architecture of the UDMTool is based on the unified data mining process (UDMP) as illustrated in figure 2.
Figure 2. The Unified Data Mining Process
The first three processes of the figure 2 are data gathering, data cleansing and then preparing a dataset. The next
process unifies the clustering, classification and visualization processes of data mining, called unified data mining
processes (UDMP) followed by the output which is the knowledge. The user evaluates and interprets the
knowledge according to his/her business rules. The dataset is the only required input; the knowledge is produced
as final output from the UDMP. As compared to the ad-hoc data mining models, the appropriate data mining
algorithms are selected automatically depending on the nature and the value of the given dataset in the UDMP.
The figure 3 depicts the architecture of the UDMTool.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 28
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
4/16
Figure 3. The Architecture of the UDMTool
The UDMTool is a multiagent system (MAS). The dataset is the required input; there are many types of datasets
like, numeric, categorical, multimedia, text and many more. First agent takes the dataset and computes the value of
Akaike Information Center (AIC), a model selection criterion, second agent creates the appropriate vertical
partitions of the dataset and the third agent computes the logarithm value of the complexities O of data mining
algorithms deployed in the UDMTool. The fourth agent is applied to input the vertically partitions of the dataset to
UDMP, which itself is a MAS, where one agent is for clustering, second agent is for classification and the third
agent is for visualization, these agents are cascaded i.e. the output of one agent is an input of second agent and theoutput of second agent is input of the third agent. The appropriate data mining algorithms for clustering,
classification and visualization are selected through the value of AIC of the given dataset, the process is completed
by an agent which maps the value of AIC with the logarithmic value of the complexities O of data mining
algorithms. The function of the UDMTool is demonstrated in figure 4.
Figure 4. The Function of the UDMTool
A well-prepared dataset is an input of this framework. First, intelligent agent compute the value model of selection
AIC, which is used to select appropriate data mining algorithm. A MAS called the UDMP is based on the UDMT.
Finally, the knowledge is extracted, which is either accepted or rejected. The relationship between dataset and
selection criterion is one-to-one i.e. one dataset and one value for model selection and between dataset and vertical
partitions is one-to-many i.e. more then one partitions are created for one dataset. The relationship between selection
criterion and the UDMP is one-to-one i.e. one value of selection model will give one data mining algorithm and
finally the relationship between vertical partitions and the UDMP is many-to-many i.e. many partitioned datasets are
inputs for the UDMP and only one result is produced as knowledge.
3. A Comparison of ODM, MS SQL Server and UDMTool
A comparison is drawn between ODM, MS SQL Server and UDMTool in table 1.
Table 1. A Comparison of ODM, MS SQL Server and UDMTool
ODM MS SQL SERVER UDMTool
It is not a magic wind. The user has to
select manually an appropriate data mining
algorithm from the available data mining
pool and if the required results are not
It is not a magic wind. The user
has to combine the different data
mining algorithms provided by
MS on Ad-hoc bases in order to
It is a magic wind. The tool is
based on Unified Data Mining
Theory (UDMT). There is no
need to select any data mining
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 29
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
5/16
produced or obtained from the selected
algorithm, one has to choose another one.
In this suite one algorithm is for one data
mining task, e.g. for clustering k-means,
but the produced clusters presents only the
groups of the data, it is not a knowledge or
serve any purpose to the user. In order to
extract the feature or pattern from the
given dataset, one has to combine or unify
different algorithms manually or one by
one and then at the end the desired results
are obtained.
find the solutions of the problem.
MS SQL Server does not provide
any facility which shows that this
combination of algorithms will
produce better results for the
problem. It provides a facility to
view the cluster profiles, which
helps the user to select the cluster
for further processing.
algorithm, the tool
automatically selects suitable and
appropriate algorithms according
to the nature of the data and
produces the knowledge in the
form of 2D graphs. The processes
for the extraction of knowledge
from the given datasets are
unified, which eases the user to
produce required results.
There is no need to prepare a dataset for
mining. It supports the already created
databases. It also provides the training
facility of a dataset.
There is no need to prepare a
dataset for mining. It supports
the already created databases. It
also provides the training facility
of a dataset.
The user has to prepare the
dataset in the form of a text or
data file. The tool does not
support any databases.
Java Implementation Interface only
supports numeric datasets and
DBMS_DATA_Mining Interface supports
categorical and numeric data.
The suite of MS algorithms
supports numeric and categorical
datasets.
The tool supports only numeric
datasets because all the programs
are implemented in Java.
The user has to set parameters for each of
algorithm in order to produce useful
pattern from the dataset. If no parameter is
set then the default values are
automatically taken by the algorithm, i.e.
the algorithms are not optimized according
to the requirement of the given dataset.
The user has to set parameters for
each of algorithm in order to
produce useful pattern from the
dataset. If no parameter is set
then the default values are
automatically taken by the
algorithm, i.e. the algorithms are
not optimized according to the
requirement of the given dataset.
The number of parameters of
algorithms in MS SQL Server is
more than ODM.
The algorithms are optimized in
this tool. Therefore, there is no
need to set default parameters.
Supports only limited number ofalgorithms for each of the data mining
tasks like clustering and classification.
ODM does not provide visualization of the
data, for this purpose the user has to
import/export the results to the other
visualization tools like MS Excel etc.
Supports only limited number ofalgorithms for each of the data
mining tasks like clustering and
classification. The results of MS
SQL Server can be opened in MS
Excel using Add-ins, which we
say a separate facility of data
visualization.
There is no such limit in thetool; the user can further add
the required algorithms. The
tool directly provides the
visualization of the dataset,
which helps the user to draw
conclusion and extract
knowledge.
It provides the support for Model
evaluation using BIC, export and import,
comparison and cross validation only in
Java Implementation Interface. Some of
the mention facilities are not supported by
the other implementation of ODM.
In MS SQL Server, testing the
accuracy of mining models is
performed through Mining
Accuracy Chart, which plots a
Lift Chart, shows the
performance of different modelsunder different algorithms.
It provides the only support for
Model evaluation and selection
using AIC. If the user wants to
import/export any result,
copy/paste can be used.
ODM implements data mining through
Java objects in function setting and
algorithm setting.
MS SQL Server uses Data
Mining Extensions (DMX) which
extends SQL commands.
UDMTool implements data
mining algorithms through
Intelligent Agents, developed in
Java.
Graphical User Interface is provided by
ODM.
IDE is provided by MS SQL
Server. Mining Model Wizards
ease the user to choose the
Graphical User Interface is
provided by UDMTool.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 30
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
6/16
different data source provided
e.g. different MS Algorithms and
in this way the system becomes
user friendly.
There is no such limit in ODM but if the
user is applying the Java language then
there may be some constraints.
There is no such limit in MS SQL
Server.
The UDMTool supports:
Number of parameters = 23
Number of Attributes = 211The Sample Size = 12000
It is obvious from the table 1 that in ODM and MS SQL Server, the selection of algorithms is on ad-hoc bases,
although both data mining suites provide the statistical information about the dataset, but these information are not
sufficient to extract the knowledge from the given dataset. The data mining processes clustering, classification and
visualization are individually carried out in ODM and MS SQL Server and there is no relation between these data
mining processes, therefore, it is difficult to extract the knowledge. On the other hand, the proposed UDMTool
unifies all the required data mining processes to extract the knowledge and the selection of the data mining
algorithm(s) in each data mining process is made through the value of model selection criterion AIC and the
complexities O of data mining algorithm(s).
4. Results and Discussion
The MS SQL Server, ODM and the UDMTool are tested on the variety of datasets, Diabetes, a medical dataset,
Breast Cancer, a medical dataset, Iris, an agriculture dataset, Sales, an account dataset and Cars, a vehicledataset. We present the results of Breastcancer, a medical dataset. The attributes of dataset Breast Cancer are:
Clump Thickness (CT), Uniformity of Cell Size (UCS), Uniformity of Cell Shape (UCSh), Marginal Adhesion
(Mad), Single Epithelial Cell Size (SECS), Bare Nuclei (BNu), Bland Chromatin (BCh), Normal Nucleoli (NNu),
Mitoses , Class (benign, malignant) [19].
Case 1: The Results of MS SQL Server
1. The Result of MS Clustering Algorithm
Figure 5. The Diagram of the Clusters of the Breastcancer dataset
We apply the MS clustering data mining algorithm which is similar to k-means clustering algorithm. Figure 5
shows the 10 clusters of the given dataset without the predictable variable. The solid lines show the strong
relation between the clusters and the thin lines show the weak relation. As it is obvious from the above figure 1,
there is a strong relation among cluster 1 and cluster 7 and 3 and the other clusters. On the other hand there is a weak
relation between cluster 1 and 10, cluster 2 and 9, cluster 2 and 6 and cluster 5 and 6. From the figure 1 one can only
visualize the structure of the clusters and their relation but it is still difficult to produce useful information. The
population means number of records per cluster of each cluster is visible by putting the curser on the cluster.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 31
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
7/16
The MS clustering algorithm produces the 10 clusters by default if the user wants to make his own choice it can only
be done through the programming of MS clustering algorithm, by using the wizards there is no option of selection of
number of clusters. Why the algorithm produces 10 clusters for each dataset it is an issue in MS clustering
algorithm? The algorithm either uses the horizontal partition or vertical partition. All the clustering data mining
algorithms are unsupervised machine learning algorithms, therefore, there is no need to specify the predicted or
target variable in the dataset. The next tables are the extra features available in MS SQL Server 2005.
Table 2. Clusters Profile
Population
(All) Size:
233
Cluster 1
Size: 84
Cluster 2
Size: 36
Cluster 3
Size: 27
Cluster 4
Size: 24
Cluster 7
Size: 18
Cluster 5
Size: 14
Cluster 8
Size: 12
Cluster 6
Size: 11
Cluster 9
Size: 5
Cluster
10 Size: 2
B Ch3.27+/-
2.37
3.27+/-
2.37
1.90+/-
0.79
5.76+/-
2.15
1.78+/-
0.80
6.55+/-
2.31
2.19+/-
1.01
5.03+/-
2.01
1.78+/-
0.84
3.31+/-
2.34
2.41+/-
0.80
4.00+/-
1.41
B Nu3.22+/-
3.40
3.22+/-
3.40
1.04+/-
0.19
6.53+/-
3.161.00
7.52+/-
3.33
1.15+/-
0.37
8.62+/-
2.51
2.62+/-
1.35
3.23+/-
2.17
1.16+/-
0.392.00
Class
benign
malignantmissing
benign:
164
malignant:69 missing:
0
benign:
1.000malignant:
0.000missing:
0.000
benign:
0.124malignant:
0.876
missing:
0.000
benign:
1.000malignant:
0.000
missing:
0.000
benign:
0.000malignant:
1.000
missing:
0.000
benign:
1.000malignant:
0.000
missing:
0.000
benign:
0.069malignant:
0.931
missing:0.000
benign:1.000
malignant:
0.000
missing:0.000
benign:0.990
malignant:
0.010
missing:0.000
benign:1.000
malignant:
0.000
missing:0.000
benign:1.000
malignant:
0.000
missing:0.000
CT 4.15+/-2.75
4.15+/-2.75
2.39+/-1.40
6.09+/-2.33
3.37+/-1.66
7.41+/-2.32
2.88+/-1.70
8.85+/-1.26
2.65+/-1.65
3.49+/-1.78
3.90+/-1.13
3.00+/-2.83
M Adh2.63+/-
2.652.63+/-
2.651.00
4.88+/-2.49
1.67+/-0.83
6.47+/-3.02
1.00+/-0.02
4.76+/-3.21
2.98+/-2.90
1.26+/-0.63
2.07+/-1.22
2.00
Mitoses1.52+/-
1.611.52+/-
1.611.00 1.00 1.00
4.72+/-3.02
1.001.93+/-
0.611.60+/-
1.891.00 1.00 2.00
N Nuc2.65+/-
2.83
2.65+/-
2.831.00
6.04+/-
3.24
1.13+/-
0.34
6.46+/-
3.11
1.74+/-
0.77
4.69+/-
2.241.00 1.00
1.87+/-
0.36
2.50+/-
0.71
SECS3.03+/-
2.08
3.03+/-
2.08
1.93+/-
0.37
4.89+/-
2.052.00
6.70+/-
2.602.00
3.22+/-
1.05
2.00+/-
1.03
2.29+/-
0.82
2.64+/-
0.942.00
UC Sh2.91+/-
2.81
2.91+/-
2.811.00
6.14+/-
2.27
1.93+/-
0.92
7.66+/-
2.54
1.40+/-
0.74
4.19+/-
1.75
1.10+/-
0.32
2.47+/-
1.63
1.19+/-
0.43
1.50+/-
0.71
UCS2.81+/-
2.862.81+/-
2.861.00
5.89+/-2.52
1.11+/-0.32
8.02+/-2.31
2.01+/-0.86
3.94+/-1.64
1.001.88+/-
1.081.31+/-
0.502.50+/-
0.71
Table 2 is about the profile of each cluster with all the attributes of the given dataset. Table also shows the size ofeach cluster i.e. the number of record per cluster. There are only two parameters of the attribute class benign and
malignant and all the other attributes have the integer values in the given dataset but the MS clustering algorithm
shows the two possible values of each attribute which may confuse the user. The value of each attribute varies from
cluster to cluster. The interpretation of table 2 is a little bit difficult.
Table 3. Clusters Characterizing
Variables Values Probability
Class benign Probability = 70.386%
Class malignant Probability = 29.614%
B Nu 3.2 - 5.5 Probability = 24.980%
B Ch 3.3 - 4.9 Probability = 24.980%
UC Sh 2.9 - 4.8 Probability = 24.980%
SECS 1.6 - 3.0 Probability = 24.980%
CT 4.2 - 6.0 Probability = 24.980%
CT 2.3 - 4.2 Probability = 24.980%
N Nuc 2.7 - 4.6 Probability = 24.980%
B Ch 1.7 - 3.3 Probability = 24.980%
UCS 2.8 - 4.7 Probability = 24.980%
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 32
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
8/16
Table 3 is about the clusters characterizing, the attribute/ variable, its value in different clusters and the probability
of the variable. The value and the probability of variables/attributes SECS, MAdh, UCSh, UCS, CT and BNu is high
in some clusters as compare to the rest of variables/attributes.
Table 4. Cluster Discrimination
Variables Values Favors Cluster 1 Favors Complement of Cluster 1
UCS 1.0 Score = 0.000UC Sh 1.0 Score = 0.069
N Nuc 1.0 Score = 0.288
M Adh 1.0 Score = 0.324
Mitoses 1.0 Score = 3.879
B Nu 1.0 1.5 Score = 28.032
UC Sh 1.0 10.0 Score = 51.050
UCS 1.0 10.0 Score = 52.941
M Adh 1.0 10.0 Score = 53.583
N Nuc 1.0 10.0 Score = 54.846
B Nu 1.5 10.0 Score = 59.424
SECS 1.3 2.5 Score = 61.108
Mitoses 1.0 10.0 Score = 64.856
SECS 2.5 10.0 Score = 76.031
Class benign Score = 79.337
Class malignant Score = 79.337
B Ch 1.0 2.8 Score = 80.075
B Ch 2.8 10.0 Score = 85.118
CT 1.0 3.3 Score = 90.353
CT 3.3 10.0 Score = 91.269
SECS 1.0 1.3 Score = 96.887
Table 4 is about the cluster discrimination and the results of only cluster 1 are shown in this table. The favor and the
complement of the favor of cluster 1 are shown. Similarly, the results of the remaining clusters can be displayed.
These are three available options after applying the MS clustering algorithm.
2. The Results of MS Decision Tree Algorithm
Figure 6. The Decision Tree of the Breastcancer dataset
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 33
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
9/16
We apply the MS Decision Tree Algorithm which is ID3 data mining algorithm, on the Breastcancer dataset. The
figure 3 depicts the structure of the decision tree. In our proposed UDMTool we are producing the rules instead of
the tree. In MS SQL Server, in order to get the decision rules, one has to apply the MS Association Rules.
3. The Results of MS Association Rules
Table 5. The Association Rules
Support Size Itemset196 1 Mitoses < 1.1818008626
164 1 Class = benign
160 2 Class = benign, Mitoses < 1.1818008626
154 1 SECS < 2.352245034
148 2 SECS < 2.352245034, Mitoses < 1.1818008626
147 1 N Nuc < 1.4350025798
145 2 SECS < 2.352245034, Class = benign
144 2 N Nuc < 1.4350025798, Mitoses < 1.1818008626
142 3 SECS < 2.352245034, Class = benign, Mitoses < 1.1818008626
141 2 N Nuc < 1.4350025798, Class = benign
140 3 N Nuc < 1.4350025798, Class = benign, Mitoses < 1.1818008626
140 1 UCS < 1.6782988738
The table 5 shows the association rules of the dataset Breastcancer. We are showing only the top support values of
the variables, otherwise the MS Association Rules Algorithms produces a long list, which also confuse the user how
to select the specific value and get the required results. It is important point to note here is that in order to get the
rules MS Association algorithm is applied, the decision tree in MS SQL Server does not produce the decision rules.
The proposed UDMTool uses C4.5 data mining algorithm for classification and produces only few rules in the form
of if-then-else which are easy to take the decision for the user.
Case 2: The Results of ODM 11g2
In ODM, there is no option to save the results of each data mining process like MS SQL Server, therefore, the
results are saved using the print screen. Figure 7 depicts the workflow of clustering model; similarly, the other data
mining models such as classification, association and feature selection are applied.
Figure 7. The Workflow of the Clustering Model
The ODM provides a visual facility of workflow of each model to the user. Figure 7 shows the workflow of the
clustering model. The data source which is a table of the oracle or a dataset is the required component, the other
component is explore data which is basically a view of the dataset, we think it is an optional component and the last
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 34
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
10/16
component is a model which is one of the data mining processes like clustering, classification, association and
feature selection as the list provided by ODM. The user can apply only one model at a time, so this is why we are
referring ODM is a single-step tool. A link is created between the data source and data explore and data source and a
model. Finally, build the model and the ODM applies all the available data mining algorithms in a model and the
user can compare the results of all algorithms and also view the results of a particular required data mining
algorithm. The user can also store the results in a separate table.
1. The Enhanced k-means Clustering Algorithm
Figure 8. The Results of the K-means Clustering Model
We apply the enhanced k-means clustering algorithm of ODM. The algorithm uses the top-down or divisive
technique of hierarchical clustering. There is an option available in ODM to set the required parameters of the
algorithm if the parameters are not set then ODM uses the default. We test the dataset by setting the default
parameters. The ODM creates the clusters in a tree structure the clusters are shown in figure 8. The characterization
of each clusters is also performed in ODM, giving the centroids and clusters rule separately, which facilitates the
user the better understanding about the cluster. In this way we assume that the ODM is unifying the clustering and
classification processes.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 35
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
11/16
Figure 9. The Results of the K-means Clustering Model with Centroid
Figure 9 shows the value of the centroids of a cluster. There is no role of the value of the centroid in the knowledge
extraction from a dataset.
Figure 10. The Results of the K-means Clustering Model with Cluster Rules
Figure 10 shows the rules of a cluster, which is a task of the classification data mining process. The rules of a cluster
are also known as decision rules play an important and vital role in the knowledge extraction from a dataset. On the
other hand our proposed UDMTool is providing the decision rules of each cluster in the next step by using the C5.4
a classification data mining algorithm. The user can apply these decision rules in simple queries for further
validation of the results.
2. The Results Classification using Decision Tree Algorithm
Figure 11. The Decision Tree Algorithm with Decision Rules
We apply the decision tree algorithm from the classification model of ODM and the results are shown in figure 11.
The algorithm creates a tree structure of clusters and provides the characterization of each cluster is given in the
form of rules, surrogates and target values. Furthermore, the number of clusters produced through the decision tree
algorithm varies from the enhanced k-means clustering algorithms. The decision rules facilitate the user the better
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 36
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
12/16
understanding about the cluster. In this way we assume that the ODM is unifying the clustering and classification
processes.
Figure 12. The Decision Tree Algorithm with Surrogates
Figure 12 shows the value of the surrogates of a cluster. There is no role of the value of the surrogates in the
knowledge extraction from a dataset.
Figure 13. The Decision Tree Algorithm with Target Values
Figure 13 shows the value of the target values of in a cluster. The percentage of the target values varies from clusterto cluster. We can say that there is no role of the value of the target values in the knowledge extraction from a
dataset.
Remark: After applying the clustering and classification models of ODM, it is difficult for the user to select the
right model because in both models first the clusters are created and then the rules of each cluster are produced. The
output of both cases is not the same. In UDMTool the first process is clustering followed by the classification and
visualization, therefore, there is no such problem in multi-step tool. We can say the results of clustering model are
accurate because in the data mining process model first the clusters are created and then the rest of the processes are
applied to extract the useful information and knowledge.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 37
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
13/16
Case 3: The Results of UDMTool
The UDMTool produces the 2D scatter graphs as the final output(s) of the Breastcancer dataset which can be
interpreted as knowledge.
Figure 14 The Graph between UCSh and MAdh attributes of Breastcancer datasetThe graph in figure 14 can be divided into two regions; in the first region, the value of the attributes Uniformity of
Cell Shape and Marginal Adhesion varies and it is constant in the subsequent second region. The outcome of this
graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer and
benign class of breast cancer for the constant values of the attributes.
Figure 15 The Graph between BCh and Mitoses attributes of Breastcancer dataset
The value of the attributes Mitoses and Bland Chromatin is almost constant throughout in this graph of figure 15.
The graph can be divided into two main regions; the value of the attributes Bland Chromatin and Mitoses varies
in the first region and remains constant in the subsequent next region. The outcome of this graph is that if the value
of the attributes is variable then the patient has malignant class of breast cancer otherwise benign class of breast
cancer for the constant value of the attributes.Table 6 below summaries the results of data mining processes clustering, classification and visualization using MS
SQL Server, ODM and UDMTool.
Table 6. Summary of the output
Data Mining
Process
MS SQL Server ODM UDMTool
Clustering 1. Uses MS Clustering algorithm
and creates 10 clusters by default
1. Uses K-means Clustering
algorithm and creates 10
1. Uses K-means
Clustering algorithm and
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 38
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
14/16
(the number of clusters are not
optimized).
2. Provides the further
characterization of each cluster
such as population per cluster,
probability of each input variable.
3. Provides the bindings (weak or
strong) among clusters.
Remark: Only the clusters
population and probability is not
sufficient to extract knowledge.
clusters by default (the number
of clusters are not optimized)
in a hierarchical structure.
2. Provides the further
characterization of each cluster
such as centroids and clusters
rule.
Remark: Clusters rules are
basically output of the
classification data mining
process. In this way the ODM
unifies clustering and
classification data mining
processes, which is a step
forward towards the
knowledge extraction.
creates 2 clusters according
to the target values of the
input variable of the given
dataset because the number
of clusters are optimized.
2. Provides the further
characterization of eachcluster such as population
per cluster.
Classification 1. Uses MS Decision Tree
algorithm and creates a horizontal
tree of the whole dataset. There are
total 8 nodes of the tree.
2. The rules of the dataset can becreated by another algorithm MS
Association. The list of the rules is
very long, some time misleading
and confuse the user in the
selection of important and the best
rules. In this way the user has to
apply two data mining algorithm to
obtain the decision rules.
Remark: The nodes of the tree do
not reflect the knowledge.
1. Uses the Decision Tree
algorithm and creates a
hierarchical tree of the whole
dataset. There are total 7 nodes
of the tree.2. Provide the further
characterization of each node
by Surrogates, Decision rules
and percentage of target value
in each node.
3. The decision rules are in the
form of (if-then-else) which
can be deployed in the simple
query.
Some times it looks like that
there is no such difference in
clustering and classification
models in ODM, the onlydifference is of the
characterization. The decision
rules vary from cluster to
cluster.
Remark: There is still
confusion in the selection of
the results of these two data
mining processes in ODM.
1. Uses the output(s) of the
clustering process as input
and applies the C4.5
(Decision Tree) algorithm
and classify each cluster byproviding the decision
rules as output.
2. The number of decision
rules varies from cluster to
cluster. The list of decision
rules is not long as in MS
SQL Server. This is also
referred as the
characterization of
classified clusters.
3. The output of this
process is in the form of
(if-then-else) like in ODM,which can be deployed in
the simple query.
Visualization There is no such model/algorithm
is provided although MS SQL
Server provides GUI in each
process of data mining. The user
can save the results and use MSExcel as visualization tool.
Remark: The data mining
processes are not unified rather
than each process is individually
carried out therefore it is difficult
to extract the knowledge.
There is no such
model/algorithm is provided
although ODM provides GUI
in each process of data mining.
The user can save the resultsand use MS Excel as
visualization tool.
Remark: The data mining
processes clustering and
classification are unified
which is a step forward in
knowledge extraction.
Provides the 2D graphs of
each classified cluster
which helps the user to
visualize then interpret the
results and finally extractthe knowledge.
Remark: The data mining
processes are unified which
eases the user to extract the
knowledge.
Conclusion A single-step data mining tool
where the selection of algorithms is
A single-step (up to some
extent a multi-step) data
A multi-step data mining
tool where the selection of
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 39
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
15/16
ad-hoc and difficult to extract
knowledge.
mining tool where the
selection of algorithms is ad-
hoc and the knowledge
extraction is ease as compared
to MS SQL Server.
algorithms is automatic
(based on the value of the
dataset) and the knowledge
extraction is very simple.
We test the Breastcancer, a medical dataset on MS SQL Server, ODM and UDMTool, three different data mining
tools. The obtained results are different although we used the same data mining algorithm in each process of datamining. Firstly, in MS SQL Server and ODM some of the inputs of the data mining algorithms are not optimized on
the other hand UDMTool uses the optimized algorithms. Secondly, the data mining processes clustering,
classification and visualization are individually carried out in MS SQL Server and there is no relation between the
data mining processes, therefore, it is difficult to extract the knowledge. In ODM clustering and classification are
unified which helps the user to extract the knowledge. In UDMTool data mining processes are unified and the output
of clustering is the input of classification and the output of classification is the input of visualization which provides
the user knowledge.
5. Conclusion
The conclusion is that in MS SQL server the selection of the data mining algorithms which are also called the MS
data algorithms is easy but the choice of the algorithm depends on the user not on the data. The user has to select
different algorithms on each step of data mining processes to obtain the knowledge which is the primary goal of the
Data Mining. In a single-step data mining tool like MS SQL Server, if one algorithm is not providing the requiredresults; the user has to choose another one to get the required results. In ODM the process of clustering and
classification is unified i.e. if the user applies the clustering algorithm it automatically produces the rules of each
cluster. Similarly, if the user chooses the classification algorithm it first produces the clusters and then the rules of
each cluster. This is somehow a step towards a multi-step knowledge extraction process. But again the choice of the
algorithm depends on the user not on the data. ODM provides facility of the workflow which is helpful for the user.
It is obvious from the above results it is difficult for the user to extract knowledge from ODM, although the tool
provides a lot of statistical information of the given dataset. We conclude that no single algorithm can produce the
knowledge, which is not possible in a single-step based data mining tools like MS SQL Server and ODM because
the knowledge is a multi-step process and our proposed UDMTool is ultimate choice.
Another issue in the single-step tools is that the selected data mining for the particular task takes the whole dataset
and produces the results. The produced results are not the inputs of other data mining tasks; therefore, it is difficult
to extract knowledge from the given dataset. It is due to the fact that in single-step tools, each data mining task is
carried out individually, instead of unifying the data mining tasks. It is only possible if the output of one data miningtask must be the input of next task i.e. the output of clustering data mining task must be the input of classification
process which is not possible in the single-step tools. One possible solution of this issue is that the user save the
results of first step in a separate dataset and then apply newly created dataset as input to the next step, which we
believe is a lengthy process because saving the results and preparing a dataset in not very simple. It is obvious from
the results of both single-step based data mining tools, MS SQL Server and ODM that no single algorithm can
produce the knowledge, because the knowledge is a multi-step process and our proposed UDMTool is ultimate
choice.
Acknowledgement
The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to
carry out this research activity under HEC project 6467/F II.
References
[1] Berry, M.J., Data Mining Techniques: For Marketing, Sales and Customer Relationship Management,
Hoboken, NJ, USA: John Wiley & Sons Incorporated, pp. 35, 2004.
[2]Skrypnik, Irina., Terziyan, Vagan., Puuronen, Seppo., and Tsymbal, Alexey, Learning Feature Selection for
Medical Databases, CBMS 1999.
[3]Peng, Y., Kou, G., Shi, Y., Chen, Z., A Descriptive Framework for the Field of Data Mining and Knowledge
Discovery, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, Page 639-682,
2008
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 40
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
-
7/30/2019 A Comparative Study of Single-step and Multi-step Data Mining Tools
16/16
[4] Grossman. Robert, Kasif. Simon, Moore. Reagan, Rocke. David and Ullman. Jeff, Data Mining Research:
Opportunities and Challenges, A Report of three NSF Workshops on Mining Large, Massive, and Distributed
Data, (Draft 8.4.5) January 21, 1998
[5] Yang. Qlang, Wu. Xindong, 10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH,
International Journal of Information Technology & Decision Making, Vol. 5, No. 4 (2006) 597604, 2006
[6] Wu. Xindong, Kumar. Vipin, Quinlan, J. Ross, et al, Top 10 algorithms in data mining, SURVEY PAPER,
Knowl Inf Syst (2008) 14:137, 2008.
[7] Das, Somenath, "Unified data mining engine as a system of patterns, Master's Theses. Paper
3440.http://scholarworks.sjsu.edu/etd_theses/3440, 2007.
[8] Singh. Shivanshu K., Eranti. Vijay Kumer., Fayad. M.E., Focus Group on Unified Data Mining Engine (UDME
2010): Addressing Challenges, Focus Group Proposal, 2010.
[9] CRISP-DM 1.0-Step-by-step data mining guide at URL:http://www.crisp-dm.org/CRISPWP-0800.pdf
[10] Oracle Data Mining Concepts 10g Release 2 (10.2) at URL:
http://docs.oracle.com/html/B14339_01/5dmtasks.htm
[11] US Census Bureau. Iris, Diabetes, Vote and Breast datasets at URL: www.sgi.com/tech/mlc/db visited 2009.
[12] Web site of Micro soft http://msdn.microsoft.com/en-us/library/bb510508(v=sql.105).aspx
[13] Oracle Data Mining (ODM) Concepts, 10g Release 1 (10.1), Part Number B10698-01, at URL:
http://docs.oracle.com/cd/B12037_01/datamine.101/b10698/ 2003.
[14] Utley, Craig, Introduction to SQL Server 2005 Data Mining, at URL: http://msdn.microsoft.com/en-
us/library/ms345131(v=sql.90).aspx, 2005.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 41