data mining tool for effective classification and retrieval o f ... · tasks such as data...

18
Data Mining Tool for Effective Classification and Retrieval of Relevant User Data Using Fuzzy and BSO 1 Antony Rosewelt & 2 Arokia Renjit 1 Department of CSE, Stella Mary's College of Engineering, Nagercoil, India 2 Department of CSE, Jeppiaar Engineering College, Chennai, India [email protected]; [email protected] Abstract Recently, the data mining techniques are used as a tool to solve the basic information or data retrieval from large volume of databases such as Data warehouses, repositories and World Wide Web. The huge volume of user data can be stored in the cloud repositories and relevant information stored and maintained in Internet. The efficiency of the data mining tools can be finalized based on the volume of relevant data or information successfully retrieved from the source. Moreover, the classification process is also playing major role to identify the right data or information and categorize them for retrieving, storing and maintaining. For this purposes, we propose a new data mining tool for retrieving the data effectively by using pre-processing and classification. Here, introduce a new semantic based data pre-processing technique for effective data pre-processing. Moreover, propose a new classification algorithm for effective data classification using fuzzy rules and Bees Swarm Optimization based Information Retrieval algorithm. In addition, group the relevant data and web pages using the existing k-means clustering algorithm in this work. During the retrieval process, inter and intra coupling relationships between the data must be analysed by using the existing semantic model. Here, the common terms for identifying the intra relationship between the data and the partial order relation used for identifying the intra-relationship between the data. Finally, the proposed mining tool has been evaluated by using the famous repositories namelyWeb-docs and Wiki-links and the user’s feedback which are collected from users by Amazon. Keywords -Information retrieval, Data mining, Bees Swarm Optimization algorithm, Fuzzy rules, Clustering, Classification, coupling inter-relationship, coupling intra-relationship. 1. INTRODUCTION The rapid development of internet and related data, the data or information retrieval related tools are playing crucial role over the relevant data extraction process. Current internet users International Journal of Pure and Applied Mathematics Volume 119 No. 16 2018, 1239-1256 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 1239

Upload: others

Post on 11-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

Data Mining Tool for Effective Classification and Retrieval of

Relevant User Data Using Fuzzy and BSO

1 Antony Rosewelt &

2 Arokia Renjit

1Department of CSE, Stella Mary's College of Engineering, Nagercoil, India

2Department of CSE, Jeppiaar Engineering College, Chennai, India

[email protected]; [email protected]

Abstract –Recently, the data mining techniques are used as a tool to solve the basic

information or data retrieval from large volume of databases such as Data warehouses,

repositories and World Wide Web. The huge volume of user data can be stored in the cloud

repositories and relevant information stored and maintained in Internet. The efficiency of the

data mining tools can be finalized based on the volume of relevant data or information

successfully retrieved from the source. Moreover, the classification process is also playing

major role to identify the right data or information and categorize them for retrieving, storing

and maintaining. For this purposes, we propose a new data mining tool for retrieving the data

effectively by using pre-processing and classification. Here, introduce a new semantic based

data pre-processing technique for effective data pre-processing. Moreover, propose a new

classification algorithm for effective data classification using fuzzy rules and Bees Swarm

Optimization based Information Retrieval algorithm. In addition, group the relevant data and

web pages using the existing k-means clustering algorithm in this work. During the retrieval

process, inter and intra coupling relationships between the data must be analysed by using the

existing semantic model. Here, the common terms for identifying the intra relationship

between the data and the partial order relation used for identifying the intra-relationship

between the data. Finally, the proposed mining tool has been evaluated by using the famous

repositories namelyWeb-docs and Wiki-links and the user’s feedback which are collected

from users by Amazon.

Keywords -Information retrieval, Data mining, Bees Swarm Optimization algorithm, Fuzzy

rules, Clustering, Classification, coupling inter-relationship, coupling intra-relationship.

1. INTRODUCTION

The rapid development of internet and related data, the data or information retrieval related

tools are playing crucial role over the relevant data extraction process. Current internet users

International Journal of Pure and Applied MathematicsVolume 119 No. 16 2018, 1239-1256ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

1239

Page 2: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

are fully depending on the tools even for retrieve the required information or data to improve

their knowledge and business due to the availability of large volume data. The conventional

information retrieval methods are working based on the keywords such as positive and

negative keywords which are useful for identifying the related terms. Even though, the

existing information retrieval methods are not satisfying the internet users fully due to the

availability of semantic challenges like polysemy and the synonymy. These challenges are

called as vocabulary or word mismatch by researchers and academicians (Furnas et al., 1987).

The enormous efforts have been taken by various researchers in the past for addressing the

word mismatch issue like query expansion methods and the lattice based information retrieval

approach for the query transmission. The query expansion generates a new query by

enhancing theaugmented query with new attributes with same meaning where the attributes

are additional keywordsthat extracted from a dictionary like WordNet and the relevance

feedback (Carpineto and Romano, 2012). Otherwise, extra keywords from the original data

sources which are used for expanding the query and concept of lattice based information

retrieval technique can be refined and also expanded the query which exploresthe navigation

search techniques by using the data specificity and the generality relation of the lattice

(Carpineto and Romano, 2005).

The fuzzy logic is used for overcoming the uncertainty issues through the development of

formal concept analysis. The standard uncertainty issues like data vagueness and the implicit

information over the relevant queries and the related documents for retrieving the relevant

data. Many fuzzy logic and lattice based techniques were proposed for handling these issues

by various researchers in the past using formal concept analysis (Poelmans et al., 2014;

Kumar et al., 2015).Many existing methods adopted the concept partial order relation of the

concepts which are available in the web, databases and repositories for computing the inter

and intra relationships between the various concepts, related web documents and data’s

available in repositories and returnsthe related web documents or data for the given user

query. However, these all methods are neglecting the semantic data between the concepts like

common objects and the attributes of concepts. Finally, the data coupling relationship

between the conceptsthat consisting of common object, common attribute andthe partial order

relationship of concept that is neglected. Moreover, coupling relationship is demonstrated

that its significant value which is used to improve the existing analysis and also the learning

International Journal of Pure and Applied Mathematics Special Issue

1240

Page 3: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

tasks such as data clustering, data classification, recommendation systems, queries and outlier

term detection process (Pang et al 2016).

In this work, a new data mining tool is proposed for retrieving the data effectively by using

data pre-processing and the data classification process. Moreover, a new semantic approach is

also proposed for effective data pre-processing. In addition, a new classification algorithm

has been proposed for effective data classification using fuzzy rules and the existingBees

Swarm Optimization based Information Retrieval method. Moreover, an existing clustering

algorithm called k-means clustering algorithm is used for grouping the data effectively based

on the relevancy score. The relevancy is also considered in this work as inter and intra

coupling relationships between the data with analysis by using the existing semantic model.

The partial order relationship is used in this work for identifying the relationship level

between the data. Finally, the proposed data mining tool has been tested with the data or

information or feedback which is collected from the famous repositories namely Web-docs

and Wiki-links and the user’s feedback.

The rest of this paper is organized as follows: Section 2 discussed in detail about the existing

data mining tools which are developed by researchers in this direction in the past. Section 3

explains the overall proposed system architecture. Section 4 described in detail about the

proposed semantic based data pre-processing, clustering and the data classification process.

Section 5 gives conclusion and the future works in this direction.

2. LITERATURE SURVEY

There are many works have been done in the direction of semantic based data pre-processing,

information retrieval, data clustering and data classification (Arokia Renjit and

Shanmuganathan 2010 and 2011) by the various researchers in the past. Among them,

Youcef et al (2018) exploredthe advances of data mining techniques for solving the basic

document retrieval problem. In their technique, they discovered the useful data by using data

mining techniques and also used the knowledge for exploring the full documents efficiently.

They have investigated the two different techniques such as data pre-processing, clustering

process and Bees Swarm Optimization for exploring the clustered and grouped documents

deeply.Their approach improved the quality of retrieved relevant documents reasonably in

less time. Shufeng et al (2018) introduced a new framework which is based on lattice and the

coupling relationship analysis. Their framework employs the formal concepts which are

International Journal of Pure and Applied Mathematics Special Issue

1241

Page 4: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

extracted by using the fuzzy formal concept analysis forrepresenting the queries and the

documents.They also find the coupling relationship analysis such as intra and inter concept

that are applied to rank the web documents.

Fabricio et al (2015) investigated that use of a bi-clustering method for capturing the local

methods of the coherence across the subsets of records and the available fields. They have

solved the dimensionality problem and reduced the redundancy of correlated features and

also improved the separability and the classification accuracy.

Thiago et al (2018) developed a new supervised classifierthat appliedthe

limitationprobabilities of the random walk theory on underlying networks which are

constructed from the input labelled data. They also demonstrated that the examples that

combines the low and high level attributes in their classifier.

Fuji (2009) exploredfew main areas of the information retrieval which are in advanced level.

The authors concentrate that related to the cross lingual, multimedia and the semantic based

information retrievals. Here, the cross lingual based information retrieval deals with rising

queries in one kind of language and also retrieve the related web documents in various

languages. In their work, the semantic based data or information retrieval that goes beyond

the level of surface data orthe related information by using the concepts which are

represented in web documents and also the user queries for improvingthe retrieval process.

Antonio et al (2018) considered as an initial point which has a new strategy based on the

clustering process. They improved the performance by solving the major issues which are

related to the records that located in near to the cluster boundaries by enlarging the size and

also consideredthe use of Deep Neural Networks that are used for learning a suitable

representation for the classification task.They achieved the reasonable classification accuracy

over the eight different datasets.

Mao et al (2013) addressed the problems such as to find the relevant documents, complicated

in use of languages, ambiguous in language and the result inaccuracy. They developed a new

semantic based content mapping technique for the information retrieval model. Their new

model employs the standard semantic features and an ontological structure for constructing a

new content map. Their model improved the accuracy of the relevant document or data

retrieval results.

International Journal of Pure and Applied Mathematics Special Issue

1242

Page 5: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

Olga et al(2017) described an effective method called PolaritySim for determining the word

level contextual polarity that uses readily available consumer rated reviews as the only

external resource.

Preben et al (2005) investigated that the expressions of collaborative activities within the

information searching and the data retrieval processes. They also presented empirical

experimental results from a real world life and also the information setting within the domain.

Moreover, they also categorise and also related to the variousstages in an information

searching and the retrieval processes. Finally, they introduced a new information retrieval

that is an improved information retrieval model in collaborative aspects.

Rabia et al (2006) employed new algorithms forranking the documents automatically. They

merged the information retrieval results of the multiple systems by using the various data

fusion algorithms and alsouse the top-ranked documents which are relevant and also

employed these relevant documentsfor evaluating and ranking the methods. Moreover, they

also introduced a new approach for the selection of information retrieval systems that are to

be used for effective data fusion. Finally, the authors proved that their method perform well

than the existing automatic ranking techniques.

Goran et al (2014) presentedthe new methods to retrieve the document and also summarized

the multi-documents.Their method measures the similarity between the queries and the web

documents which combines the graph kernels on event graphs. Their model achieved the

better clustering performance and the relevant multi-document summarization.

Antonio et al (2010) proposed anew algorithm for refining the ontologies that are used for

relevant information retrieval tasks with the preliminary positive results. Andrea et al (2012)

presentedtheir experience in using X.MAS that is a generic multi-agent architecture which

aimed at the process of relevant information retrieval, data filtering and also reorganizing the

information based on the user requests. Tatiana et al (2013) describedin detail about the basic

theories of human development that used to explain the specifics of young users such as their

cognitive skills, fine motor skills, knowledge, memory and emotional states in so far as they

differ from those of adults.

Sairamesh et al (2015) proposed a new algorithm to infer the user interests that are based on

the user queries and the fast profile logs and also to provide the relevant information which is

based on the user personalization. Moreover, they introduced a new classifier for classifying

International Journal of Pure and Applied Mathematics Special Issue

1243

Page 6: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

the data and also apply a new ranking algorithm for categorizing the relevant data.

Kulunchakov et al (2017) proposed a novel approach for constructing new ranking algorithms

for effective relevant information or data retrieval. Mehrbakhsh et al (2018) proposed a new

recommendation systemthat is based on the ontology and the dimensionality reduction

techniques forimproving the sparsity and the scalability problems.

Obada et al (2017) proposed a novel method by using fuzzy logic for developingthe tasks,

user profiles and documents to model the user relevant information searching behaviour. The

feedback relevancy is also calculated and considered in this work by using a linear regression

model that used to predict the web document relevancy based on the implicit relevance

indicators. Moreover, the fuzzy rule based summarisation was also used for integrating the

profiles. The overall performance of their method was evaluated based on the evaluation

metrics such as precision and recall metrics that shows the significant improvements in the

relevant information retrieval based on the user queries.

3. SYSTEM ARCHITECTURE

The overall architecture of the proposed system developed for analysing the web data and

documents in this work is shown in Figure 1. It consists of six modules such as web

documents/ feedback data, a user interface module, an intelligent data mining tool, a rule

manager, a rule base and results.

Figure 1.System Architecture

User Interface Module

Intelligent Data Mining

Tool

Data Pre-processing

Rule Manager

Document Clustering

Data Classification

Rule

Base Result

Web

Documents/

Feedback data

International Journal of Pure and Applied Mathematics Special Issue

1244

Page 7: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

The web documents and feedback data consists of large volume of web documents and also

the feedback data that are available in amazon website and cloud repositories. The collection

of web documents and web data like feedback data that have been considered as input dataset

in this work. The user interface module collects the necessary web documents and web data

like feedback data from business websites like amazon. The Intelligent Data Mining tool

consists of three sub modules such as data pre-processing, clustering and data classification.

Here, the data pre-processing sub module is taken care of removing the noisy data, null data

and meaningless data. The clustering module is responsible for grouping the relevant data or

relevant web documents using the existing k-means clustering algorithm. The classification

sub module is responsible for categorizing the data or documents effectively by using

intelligent fuzzy rules. The rule manager manages the fuzzy rules and interacts with data

mining tool and rule base. It stores and retrieves the fuzzy rules over the knowledge base. The

rule base stores all kinds of fuzzy rules which are useful for categorizing the feedback data

and for classifying the web documents. The proposed model refers to the rule base built

around user queries. The result module holds the resulted documents or feedback data of the

user query.

4. PROPOSED WORK

In this section, we discussed in detail about the proposed model which is the combination of

data pre-processing, clustering and classification. In the proposed model, a new semantic

based data pre-processing technique is proposed in this work for identifying the original and

useful data for the analysis. Moreover, an existing clustering algorithm is used for grouping

the relevant data or web documents for further analysis quickly. In addition, a new classifier

is also proposed for effective data or document classification. This section is categorized into

three subsections such as semantic based data pre-processing, K-Means clustering and Fuzzy

Rule and BSO based Classification.

4.1 Semantic based Data Pre-processing

The main aim of data pre-processing is to enhance the capability of the existing data mining

tools which are used for extracting the relevant data or documents that is used in this work

later by the proposed fuzzy rule and BSO based classifier. Here, it removes the unnecessary

data like null values from the input dataset. Moreover, it checks the availability of semantic

data or content which are available in the dataset. In this proposed data pre-processing phase

International Journal of Pure and Applied Mathematics Special Issue

1245

Page 8: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

is responsible to tokenize the input content, check the grammar and checks the content is

semantically correct or not.

4.2 K-Means Clustering

The K-means algorithm is used in this work for grouping the ‘n’ points into ‘k’ subsets 𝑆𝑢𝑏𝑗 .

Every subset of the clusters that are having the 𝑛𝑠𝑗 number of data points in a cluster. First,

the data points (𝑛𝑠𝑗 ) that are assigned randomly to the k number of clusters and also the

centroid point is also calculated for each cluster. Then, each centroid point is also assigned

for the cluster whose point is very close to that centroid point. The above mentioned steps are

repeated when there is no assignment further of the data points that are to the clusters. In this

work, the adaptation of K-means clustering algorithm to the proposed work in two steps. In

first step, the web data weightage is assigned for all the documents with individual words

weightage. The term frequency and relevancy score are also calculated based on the words

weightages and the occurrence of a word in a document is calculated in step 2 of this work.

4.3 Fuzzy Rule and BSO based Classification

In this subsection, a new fuzzy rule and BSO based classifier is explained in detail that is

incorporating with the proposed intelligent data mining tool which is developed for effective

relevant information retrieval from repositories. The proposed classifier is the combination of

the existing BSO based classification algorithm that is developed by [1] and the necessary

rules have been incorporated for making effective decision over the retrieval process on web

data.

4.4 Intelligent Data Mining Tool for Information Retrieval

In this work, a new and intelligent data mining tool has been designed for relevant data

retrieval from the large volume of databases and the cloud repositories. Here, we have used a

semantic based data pre-processing, clustering and classification techniques for effective data

retrieval. This tool has three different phases for taking care these three different activities in

this tool.

International Journal of Pure and Applied Mathematics Special Issue

1246

Page 9: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

Input: WebDocuments

Output: Relevant Documents

Phase 1: Pre-processing

Step 1: Read the documents one by one from the list DL = (d1, d2, d3….dn)

Step 2: Read first line from the first document Di and also checks the tokens from the

standard metadata.

Step 3: Apply the parser over the line sentence ‘l’.

Step 4: Call the syntax analyser for grammar checking.

Step 5: if the line doesn’t have any grammatical error then

Apply LSA (DLi, S, l)

Else

Correct the grammatical errors and Go to step 5.

Step 6: Create a semantic network for the line by calling the procedure semantic_network()

Step 7: Compare the developed semantic structure for the line and the semantic metadata of

the line in node wise for the whole sentence.

Step 8: If the line is matched semantically with metadata then

Step 9: If the data is not end then

Display the semantic analysis results

Else

Goto Step 13.

Step 10: Apply the procedure Pragmatic_Analysis()

Step 11: If Anaphora is resolved then Goto Step 9

Step 12: Else Checksthe file status

Step 13: If EOF then Stop

Step 14: Else Go To Step 1.

Phase 2: Clustering

Step 1: Set the ‘k’ number of clusters

Step 2: Select the ‘k’ initial center points for the all ‘k’ groups.

Step 3: Weightages are assigned for all the words that are available in the document as a

word representation by using the expert guidelines which are stored in the database.

International Journal of Pure and Applied Mathematics Special Issue

1247

Page 10: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

Step 4: Read first document from set of documents

Step 5: Find the Cosine similarity for the words belongs to a first group and store it into m.

Step 6: Checks the cosine similarity of each document words

Step 7: if the cosine similarity of the words is less than that will be considered as minimum

cosine value of the whole data.

Step 8: if any one of the document words are changed the average score of a group then

Stop the process and exit

Else

Find the new center point of each groups which are available in a cluster.

Step 9: Return the clustered document set.

Phase 3: Classification

Step 1: Accept the user request from the users queries

Step 2: Apply the existing classifier called Bees Swarm Optimization based Information

Retrieval algorithm over the clustered documents that are available in the document

list.

Step 3: Provide the relevant content or data to the user and apply fuzzy rules.

Step 4: Map the semantic fuzzy rules and the nodes that are available in the newly

constructed semantic tree nodes.

Step 5: If the nodes and rules are matched then

Produce all the relevant contents.

End if

Step 6: Call the procedure for retrieving the exact contents.

The proposed data mining tool performs three different actions such as semantic based pre-

processing, k-means clustering for data and grouping the documents and the fuzzy rule based

BSO-IR for effective relevant data from the databases.

5. RESULTS AND DISCUSSION

International Journal of Pure and Applied Mathematics Special Issue

1248

Page 11: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

This section described in detail about the test bed that is used to evaluate the proposed data

mining tool which is used for retrieving the relevant data or documents from the web or

repositories. Here, the famous performance metrics are used to measure the performance of

the proposed data mining tool which is used for retrieving the relevant data. The experiments

have been conducted using the web documents which are containing the product review as a

feedback about the product or company and the CSV file which contains the user feedback

about amazon products. The Java program was used for implementing the data mining tool.

The prediction accuracy over the documents or user data has been calculated in this work

using the following metrics such as precision, recall and F-measure which are defined below:

𝑃𝑅𝐸𝐶𝐼𝑆𝐼𝑂𝑁 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 (1)

𝑅𝐸𝐶𝐴𝐿𝐿 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑟𝑖𝑣𝑒𝑑 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚 𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑖𝑛 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 (2)

𝐹 −𝑀𝐸𝐴𝑆𝑈𝑅𝐸 = 2 × 𝑃𝑅𝐸𝐶𝐼𝑆𝐼𝑂𝑁 ×𝑅𝐸𝐶𝐴𝐿𝐿

𝑃𝑅𝐸𝐶𝐼𝑆𝑂𝑁 +𝑅𝐸𝐶𝐴𝐿𝐿 (3)

The five experiments have been conducted for evaluating the proposed algorithm over the

web documents. Figure 2 shows the performance analysis of the proposed data mining tool

which is developed for retrieving the relevant information from the available documents in

database. This paper considered the different levels of web documents such as 600, 700, 800,

900 and 1000.

International Journal of Pure and Applied Mathematics Special Issue

1249

Page 12: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

Figure 2. Performance Analysis of the data mining tool over the web documents

From figure 2, it can be seen that the performance of the proposed data mining tool which is

the combination of data pre-processing, clustering and classification activities. The proposed

tool accuracy is significantly changed based on the number of sentences considered for

conducting the experiments.

Figure 3 shows relevancy score analysis between the proposed IRA and the proposed IRA

with semantic indexing. Here, five different numbers of documents are considered for

experiments such as 50, 100, 150, 200, 250 and 300. These documents contain the different

levels of documents.

Figure 3. Relevancy Score Analysis between IRA and IRA with Semantic Indexed table

97

97.5

98

98.5

99

99.5

100

600 700 800 900 1000

Acc

ura

cy (

%)

No. of Web Documents

Performance Analysis

Without Semantic Indexing

With Semantic Indexing

9898.198.298.398.498.598.698.798.898.9

99

100 200 300 400 500

Re

leva

ncy

Sco

re (

%)

No. of documents

Relevancy Score Analysis

IRA

Semantic Indexed table+IRA

International Journal of Pure and Applied Mathematics Special Issue

1250

Page 13: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

From figure 3, it can be observed that the performance of the proposed semantic index table

with IRA is better than the without semantic index table.

Figure 4 shows the relevance score analysis between the proposed IRA, FTCM algorithm

with semantic index table, FTCM and FTCM. Here, this work has considered the various

documents such as 50, 100, 150, 200, 250 and 300 for experiments.

Figure 4. Relevancy score Analysis between the proposed system and FTCM

From figure 4, it can be observed that the performance of the proposed system is better than

the existing FTCM algorithm.

Table 1 shows the performance of the proposed system and the existing systems. It consists

of precision, recall and f-measure values for the proposed system and the existing systems.

Table 1. Performance Analysis

Method Name Precision Recall F-Measure Accuracy

(%)

FCM 93.21 92.98 93.98 93.41

FTCM 98.24 98.52 98.75 98.56

FTCM+IRA 98.93 98.96 99.02 98.98

FTCM+IRA+ Semantic

Index Table 99.45 99.65 99.78 99.72

95

95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

100 200 300 400 500

Re

leva

ncy

Sco

re (

%)

No. of documents

Relevancy Score Analysis

FTCM

Semantic Indexed table+IRA+FTCM

International Journal of Pure and Applied Mathematics Special Issue

1251

Page 14: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

From table 1, it is observed that the precision, recall, F-measure and the percentage of

accuracy for the proposed system is higher than the existing systems. This is due to the fact

that the proposed system provides a semantic index table which is useful for performing

syntax and semantic oriented ordering of documents.

Figure 5 shows the accuracy analysis between the proposed recommendation system and the

existing model. Here, this system considered various documents such as 50, 100, 150, 200,

250 and 300 for experiments.

Figure 5. Accuracy Analysis between the proposed data mining tool and the existing

model

From figure 5, it can be observed that the accuracy analysis of the proposed recommendation

system is better than the existing model by more than 2%. This is due to the fact that the use

of effective semantic based pre-processing, effective document clustering and the fuzzy rule

base BSO-IR.

6. CONCLUSION AND FUTURE WORKS

A new data mining tool is developed in this work for retrieving the data effectively by using

pre-processing and classification. Here, a new semantic based data pre-processing technique

is also proposed and implemented for effective data pre-processing. Moreover, a new

classification algorithm is also proposed and implemented for effective data classification

using fuzzy rules and Bees Swarm Optimization based Information Retrieval Method.

Moreover, group the relevant data and web pages using the existing k-means clustering

90

92

94

96

98

100

50 100 150 200 250 300

Acc

ura

cy (

%)

No. of Documents

Recommendation Accuracy Analysis

Existing Model

Proposed Recommendation System

International Journal of Pure and Applied Mathematics Special Issue

1252

Page 15: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

algorithm in this work. During the retrieval process, inter and intra coupling relationships

between the data must be analysed by using the existing semantic model. Here, the common

terms for identifying the intra relationship between the data and the partial order relation used

for identifying the intra-relationship between the data. Finally, the proposed mining tool has

been evaluated by using the famous repositories namely Web-docs and Wiki-links and the

user’s feedback which are collected from users by Amazon.Future works in this direction

could be the introduction of new document clustering algorithm for enhancing the

performance of the classifier which is used for retrieving the relevant documents from web

database.

REFERENCES

1. Youcef Djenouri, Asma Belhadi, Riadh Belkebir, "Bees swarm optimization guided by

data mining techniques for document information retrieval", Expert Systems With

Applications, Vol. 94, pp.126–136, 2018.

2. Shufeng Hao, Chongyang Shi, Zhendong Niu, Longbing Cao, "Concept coupling

learning for improving concept lattice-based document retrieval", Engineering

Applications of Artificial Intelligence, Vol. 69, pp. 65–75, 2018.

3. Fabrício O. de França, André L.V. Coelho, "A biclustering approach for classification

with mislabeled data", Expert Systems with Applications, Vol. 42, pp. 5065–5075, 2015.

4. Thiago Henrique Cupertino, Murillo Guimarães Carneiro, Qiusheng Zheng,Junbao

Zhang, Liang Zhao, "A scheme for high level data classification using random walk

andnetwork measures", Expert Systems With Applications, Vol. 92, pp. 289–303, 2018.

5. Fuji Ren,"Advanced Information Retrieval", Electronic Notes in Theoretical Computer

Science, Vol. 225, pp. 303–317, 2009.

6. Antonio-Javier Gallego, Jorge Calvo-Zaragoz, Jose J. Valero-Mas, Juan R. Rico-Juan,

"Clustering-based k -nearest neighbor classification for large-scale datawith neural codes

representation", Pattern Recognition, Vol. 74, pp. 531–543, 2018.

7. Mao-Yuan Pai, Ming-Yen Chen, Hui-Chuan Chu, Yuh-Min Chen, "Development of a

semantic-based content mapping mechanism forinformation retrieval", Expert Systems

with Applications, Vol. 40, pp. 2447–2461, 2013.

8. Olga Vechtomova, "Disambiguating context-dependent polarity of words:

Aninformation retrieval approach", Information Processing and Management, Vol. 53,

pp. 1062–1079, 2017.

International Journal of Pure and Applied Mathematics Special Issue

1253

Page 16: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

9. Preben Hansen, Kalervo Jarvelin, "Collaborative Information Retrieval in aninformation-

intensive domain", Information Processing and Management, Vol. 41, pp. 1101–1119,

2005.

10. Rabia Nuray, Fazli Can, "Automatic ranking of information retrieval systemsusing data

fusion", Information Processing and Management, Vol. 42, pp. 595–614, 2006.

11. Goran Glavaš, Jan Šnajder, "Event graphs for information retrieval and multi-

documentsummarization", Expert Systems with Applications, Vol. 41, pp. 6904–6916,

2014.

12. Obada Alhabashneh, Rahat Iqbal, Faiyaz Doctor, Anne James, "Fuzzy rule based

profiling approach for enterpriseinformation seeking and retrieval", Information

Sciences, Vol. 394–395, pp. 18–37, 2017.

13. A.S. Kulunchakov, V.V. Strijov, "Generation of simple structured information retrieval

functions bygenetic algorithm without stagnation", Expert Systems with Applications,

Vol. 85, pp. 221–230, 2017.

14. Andrea Addis, Giuliano Armano, Eloisa Vargiu, "Multiagent systems and information

retrieval our experience with X.MAS", Expert Systems with Applications, Vol. 39, pp.

2509–2523, 2012.

15. Antonio Jimeno-Yepes, Rafael Berlanga-Llavori, Dietrich Rebholz-Schuhmann,

"Ontology refinement for improved information retrieval",Information Processing and

Management, Vol. 46, pp. 426–435, 2010.

16. Tatiana Gossen, Andreas Nürnberger, "Specifics of information retrieval for young

users: A survey", Information Processing and Management, Vol. 49, pp. 739–756, 2013.

17. L Sai Ramesh, Sannasi Ganapathy, R Bhuvaneshwari, Kanagasabai Kulothungan, V

Pandiyaraju, Arputharaj Kannan, "Prediction of user interests for providing relevant

information using relevance feedback and re-ranking",International Journal of Intelligent

Information Technologies (IJIIT), Vol.11, No.4, pp. 55-71, 2015.

18. Mehrbakhsh Nilashi, Othman Ibrahim, Karamollah Bagherifard, "A recommender

system based on collaborative filtering using ontologyand dimensionality reduction

techniques", Expert Systems with Applications, Vol. 92, pp. 507–520, 2018.

19. J Arokia Renjit, KL Shunmuganathan, "Network based anomaly intrusion detection

system using SVM",Indian Journal of Science and Technology,Vol. 4, No. 9, pp. 1105-

1108, 2011.

International Journal of Pure and Applied Mathematics Special Issue

1254

Page 17: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

20. J Arokia Renjit, KL Shunmuganathan,"Mining the Data from Distributed Database using

an Improved Mining Algorithm", International Journal of Computer Science and

Information Security,Vol. 7, No. 3, pp.116-121, 2010.

AUTHORS BIOGRAPHY

Antony Rosewelt, completed his undergraduate in Electronics and Communication

Engineering in Anna University, post graduate in Computer Science and Engineering in Anna

University. Currently he is working as Assistant Professor in the Department of Computer

Science and Engineering, Stella Mary's College of Engineering. His research interests are

Data Mining, Soft Computing and Ad-hoc Networks.

J. Arokia Renjith has completed his undergraduate in Electrical and Electronics Engineering

in Bharathiyar University, post graduate in Computer Science and Engineering in Anna

University and PhD in Computer Science and Engineering at Sathyabama University. He is in

the teaching field for the past 16 years.Currently he is working as Professor and Head in the

Department of Computer Science and Engineering, Jeppiaar Engineering College.His

research interests are Data Mining, Cloud Computing and Network Security.

International Journal of Pure and Applied Mathematics Special Issue

1255

Page 18: Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection

1256