international journal of pure and applied mathematics ... · distributed environment a sales data...

6
1 EFFICIENT FREQUENT PATTERN SEARCHING USING AMOEBA AND DECISION TREE TECHNIQUE B.Mahalakshmi, A.T.Nandhini K.S.Rangasamy College of Technology Abstract Data mining plays a vital role due to improvement in technologies, thereby extracting the hidden information or patterns of data from a huge database or collection of large data set. In this paper, concerning about finding the frequent pattern of words from a collected dataset using amoeba model. A new algorithm called AMOEBA, is used to find the chain of possible frequent patterns. All documents in dataset can be analyzed by reading the files. The semantic word in the document can be scanned with the help of wordnet tool. Every semantic word in the document is scanned for further processing. Now, the AMOEBA model is used for clustering both document and word simultaneously. The generated model is optimized by AMOEBA algorithm which provides efficiency. Finally the optimized word can be chosen by using decision tree, which helps to make clear result for the user search. Therefore this algorithm will win the space and time complexity by, in-comparison with Aprior and FP-Growth. Introduction Pattern mining is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth. Frequent patterns are itemsets, subsequences or a substructure that appears in a data set with frequency not less than user specified threshold. Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. Let I= {i 1 , i 2 , i 3,….. i n } be a set of binary attributes called items. Let D= {t 1 , t 2 , t 3 ,…, t m } be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form: X→ Y, where X, Y I. In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are used. The best known constraints are minimum thresholds on support and confidence. Let X is an itemset, X→ Y an association rule and T a set of transactions of a given database. Support: Support is an indication of how frequently the itemset appears in the dataset, Support(X) = . Confidence: Confidence is an indication of how repeatedly the rule has been found to be true. The confidence value of a rule, X→ Y , with respect to a set of transactions T , is the proportion of the transactions that contains X which also contains Y, co nf(X→ Y) = . The FP-Growth Algorithm will mining the pattern by the complete set of International Journal of Pure and Applied Mathematics Volume 119 No. 10 2018, 1921-1926 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 1921

Upload: others

Post on 01-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: International Journal of Pure and Applied Mathematics ... · distributed environment a sales data is placed, from that data the consistent and inconsistent association rule c an be

1

EFFICIENT FREQUENT PATTERN SEARCHING USING AMOEBA AND DECISION

TREE TECHNIQUE

B.Mahalakshmi, A.T.Nandhini

K.S.Rangasamy College of Technology Abstract

Data mining plays a vital role due to improvement in technologies, thereby extracting the

hidden information or patterns of data from a huge database or collection of large data set. In this

paper, concerning about finding the frequent pattern of words from a collected dataset using

amoeba model. A new algorithm called AMOEBA, is used to find the chain of possible frequent

patterns. All documents in dataset can be analyzed by reading the files. The semantic word in the

document can be scanned with the help of wordnet tool. Every semantic word in the document is

scanned for further processing. Now, the AMOEBA model is used for clustering both document

and word simultaneously. The generated model is optimized by AMOEBA algorithm which

provides efficiency. Finally the optimized word can be chosen by using decision tree, which

helps to make clear result for the user search. Therefore this algorithm will win the space and

time complexity by, in-comparison with Aprior and FP-Growth.

Introduction

Pattern mining is an efficient and

scalable method for mining the complete set

of frequent patterns by pattern fragment

growth. Frequent patterns are itemsets,

subsequences or a substructure that appears

in a data set with frequency not less than

user specified threshold. Association rule

learning is a rule-based machine learning

method for discovering interesting relations

between variables in large databases. Let I=

{i1, i2, i3,….. in} be a set of binary attributes

called items. Let D= {t1, t2, t3,…, tm} be a

set of transactions called the database. Each

transaction in D has a unique transaction ID

and contains a subset of the items in I. A

rule is defined as an implication of the form:

X→ Y, where X, Y ⊆ I. In order to select

interesting rules from the set of all possible

rules, constraints on various measures of

significance and interest are used. The best

known constraints are minimum thresholds

on support and confidence. Let X is an

itemset, X→ Y an association rule and T a

set of transactions of a given database.

Support: Support is an indication of how

frequently the itemset appears in the dataset,

Support(X) = . Confidence:

Confidence is an indication of how

repeatedly the rule has been found to be

true. The confidence value of a rule, X→ Y ,

with respect to a set of transactions T , is the

proportion of the transactions that contains

X which also contains Y, conf(X→ Y) =

.

The FP-Growth Algorithm will

mining the pattern by the complete set of

International Journal of Pure and Applied MathematicsVolume 119 No. 10 2018, 1921-1926ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

1921

Page 2: International Journal of Pure and Applied Mathematics ... · distributed environment a sales data is placed, from that data the consistent and inconsistent association rule c an be

2

frequent patterns using pattern fragment

growth, is an efficient and scalable method.

For storing compressed and crucial

information about frequent patterns it uses

an extended prefix- tree structure called

frequent-pattern tree (FP-tree). The

performance metric of this algorithm is

better when compared with APRIORI. In

this algorithm introducing frequent item sets

without using candidate generations. This

algorithm has been come up with a divide-

and-conquer strategy. Apriori is one of the

most commonly used association mining

algorithm for finding the frequent patterns.

An assigned support and calculated

confidence factors is calculated for finding

its frequent patterns.

A new algorithm named AMOEBA

is proposed based on the characteristics of

unicellular organism amoeba. This

algorithm is planned to rise above the pre-

calculations of some association mining

algorithms.

Overview of the literature survey

Nobuo Suzuki et.al., discussed to get a more frequency resources using radio systems with frequency sharing, which is

one of the critical technique. The characteristics by a series of data can be

taken using frequent sequence mining technology.

Dinesh J. Prajapati et.al., describes, in a distributed environment a sales data is

placed, from that data the consistent and inconsistent association rule can be identified. It can be performed by using

mapreduce algorithm to provide a useful knowledge to the domain expert.

Songfeng lu et.al., uses an FP-growth algorithm for mining frequent itemset. The

EFP-growth(Enhanced Frequent Pattern) is used to achieve the best quality of FP-

growth. In a transaction database EFP-growth is used to discover the frequent pattern. Depends on this method the

minimum supports are decreased under execution time.

Roshni Chandran et.al., discussed that, in a real-time data stream a discovery

of knowledge is increased by using time-efficient Hadoop CanTree- GTree algorithm.

It mines the complete frequent item sets from real time transactions with the help of sliding window technique.

Methodology

Amoeba is a unicellular organism

which is irregular of its shape and belongs to

phylum protozoa. The name "amibe" was

specified to its by Bory de Saint-Vincent,

from Greek amoibe, sense change. Amoeba

moves by means of pseudopodia or "false

feet". There are many hypothesis have been

introduced to simplify the mechanism of

AMOEBA movement, but still there is a

mystery of exact association of AMOEBA.

These attribute features of amoeba guide to

the evolution of a new association mining

algorithm AMOEBA. Amoeba moves in a

route which is not detailed. This is due to the

existence of false feet in amoeba. This

characteristic was enabled for the evolution

of this new algorithm. This can be termed as

attribute value determining. This

determination can be achieved by using

functional dependency i.e. determining an

attribute value by another attribute value.

The determination also includes, at what

percentage an attribute value determines

International Journal of Pure and Applied Mathematics Special Issue

1922

Page 3: International Journal of Pure and Applied Mathematics ... · distributed environment a sales data is placed, from that data the consistent and inconsistent association rule c an be

3

other attribute distinct values. This

algorithm works mainly on two principles:

Determining another attribute value

in a data set using an attribute value.

(Or) Determining another attribute

value in a data set which determined

the attribute value.

Probability of an attribute value

being determined by an attribute

value.

Extraction of documents

Constraint to cluster the document is

created automatically by using NE extractor.

Document is parsed to identify named

entity. NE extractor, extract entity form

documents which are provided by user. If

there are overlapping NEs in two documents

and the number of overlapping NEs is

larger, and then an entity added as constraint

for document clustering. Named-entity-

based document constraints, is likely to

integrate additional lexical constraints

resulting from existing knowledge sources

to further improve clustering results.

Document constraint using NE extractor

Mining Semantic Words

Constraint to cluster the word is

created automatically by using WordNet

which is lexical database for English. The

semantic relatedness between words can be

measured based on the word hierarchies in

the Wordnet. Parse the document and

compare word with WordNet to create

constraint. Furthermore, while word

knowledge can be transferred to the

document side during coclustering, with

additional word constraints, it is achievable

to further progress in document clustering as

well.

Mining Semantic word on Wordnet tool

Retrieval of clustered word

The document constraint extracted

from NE overlapping and word constraint

created from WordNet. AMOEBA is

modeled for both document and word to

perform the cluster simultaneously.

AMOEBA is used to formulate the prior

information for both document and word

latent labels.

document

document

NE

extractor

Document

constraint

NE

overlapping

document Word

Extractor

Word

constraint

Wordnet

Document

Constraint

Co

Clustering

Word

Constraint

AMOEBA

Model

International Journal of Pure and Applied Mathematics Special Issue

1923

Page 4: International Journal of Pure and Applied Mathematics ... · distributed environment a sales data is placed, from that data the consistent and inconsistent association rule c an be

4

Retrieval of Clustered Word

Determining the probability

AMOEBA generated model is

optimize by amoeba algorithm for

efficiency. EM algorithm is optimizing the

latent labels in the model. There are two

steps in the amoeba algorithm:

Determining another attribute value

in a data set which determined the

attribute value.

Probability of an attribute value

being determined by an attribute

value.

Optimizing model based on Co-

clustering

Conclusion

Algorithm AMOEBA does not require

the construction of transaction data set,

calculation factors like support and

confidence and assembling of frequent

pattern trees. The restriction AMOEBA is,

input data set must be discredited because

determination through chance can be

defined on discrete values. The probability

of frequent items of these frequent items sets

decreases with increase in size of frequent

item set. Choice of initial attribute value,

manipulate the evolution of frequent items

chain for the algorithm AMOEBA. This is

due to that , if the determination values of

other attributes by initial attribute value are

lowest of its probability or zero then such

initial attribute value becomes void for

finding frequent items chain. Selection of

such initial attribute value whose possibility

of determining other attribute values is zero,

results is to identify out the infrequent items

in a data set. This attribute value cannot be

integrated in frequent item set. A decision

tree is a decision support tool that uses a tree

like graph or model of decisions and their

possible consequences, including chance

event outcomes and its utility. It helps to

identify a strategy most likely to reach a

goal. The algorithms, Apriori and FP

Growth cost more, when compared with the

algorithm Amoeba in terms of disk usage.

References

[1] Aiman Moyaid Said, Dhanapal Durai Dominic and Brahim Belhaouari

Samir, (2015), “Outlier Detection Scoring Measurements Based on Frequent Pattern Technique”,

vol.6,pp.1340-1347.

[2] Iqbal Gondal and Joarder Kamruzzama,(2014), “A Technique for Parallel Share-Frequent Sensor

Pattern Mining from Wireless Sensor Networks”,vol.29,pp. 124–133.

[3] Jay Ayres and Johannes

Gehrke,(2015), “Sequential PAttern

Mining using A Bitmap Representation”, pp.501-507.

[4] Y. Jeya Sheela and S. H.

Krishnaveni, (2015), “A Novel

Frequent Pattern Mining Approach with OTSP”, vol.5, pp. 2275-2284.

Cluster

model

EM Algorithm

E-Step Cluster

data

M-Step

International Journal of Pure and Applied Mathematics Special Issue

1924

Page 5: International Journal of Pure and Applied Mathematics ... · distributed environment a sales data is placed, from that data the consistent and inconsistent association rule c an be

5

[5] Karsten M. Borgwardt and Mahito

Sugiyama,(2017) “Significant Pattern Mining on Continuous

Variables”,pp.1-14.

[6]

[7]

Nighat Usman and Saeeda

Usman,(2016),“Novel Internet of Things-centric Framework to Mine

Malicious Frequent Patterns”,pp.401-409. J. Sree Subhashini, V. Bakyalakshmi,

“Parallel Mining Of Frequent Item sets Using Map Reduce And

Fidoop”, International Journal of Innovations in Scientific and Engineering Research (IJISER),

Vol.3, No.11, pp.94-91, 2016.

[8]

Sandeep Ku. Satapathy and Shruti Mishra, (2012), “Fuzzy Frequent Pattern Mining from Gene

Expression Data using Dynamic Multi-Swarm Particle Swarm

Optimization”,vol.4, pp. 797 – 801.

[9] R.R.Sedamkar and Sheetal Rathi,

(2016), “An Improved PrePost Algorithm for Frequent Pattern

Mining with Hadoop on Cloud”, vol.6, pp. 207 – 214.

[10] Shamila Nasreen, Usman Naeem, (2015),“Frequent Pattern Mining

Algorithms for Finding Associated Frequent Patterns for Data Streams: A Survey”,vol.3,pp.109-116.

[11] M.Vedanayaki,(2016), “A Study of

Data Mining and Social Network Analysis”,vol.7,pp.185-187.

[12] Wenyao Cheng and Xiang Zhang,(2016), ”Pattern Mining in

Linked Data by Edge-Labeling”,vol.21,pp. 168-175.

[13] Yizhou Sun, (2016) “Community

Trend Outlier Detection using Soft Temporal Pattern Mining”, vol.4,

pp.118-127.

International Journal of Pure and Applied Mathematics Special Issue

1925

Page 6: International Journal of Pure and Applied Mathematics ... · distributed environment a sales data is placed, from that data the consistent and inconsistent association rule c an be

1926