investigations on modern algorithm for utility mining
TRANSCRIPT
INVESTIGATIONS ON MODERN ALGORITHM FOR UTILITY MINING
1Nandhini S S,, 2Kannimuthu S
1Assistant Professor/CSE
Bannari Amman Institute of Technology
Erode, TamilNadu, India 2Associate Professor/CSE
Karpagam College of Engineering
Coimbatore, TamilNadu, India
[email protected], Abstract
It is obvious that data mining will generate millions of patterns from data given. The irony is,
the resulting patterns itself need to be mined on a loop. Since most of identified patterns from
traditional data mining algorithm is already known to the group who owns the data or on the
other hand the pattern mined may not possess anything useful commercially and may not bring-
in profit to the group. So, the resultant patterns look cluttered with most non-profitable,
unwanted, infrequent and uninteresting patterns since the data is uncertain. These so called
unwanted, uninterested patterns can be removed by applying frequent itemset mining and yet to
clear the clutter, apply high utility itemset mining which provides a clutter-free patterns which
are frequent as well as profitable. This survey review the various algorithms proposed for mining
high utility itemset from uncertain databases and compares them based on the domain, data
structure used, data set taken for utilization and with that it gives strategies for selecting an
appropriate algorithm for applications and identifies opportunities for further development in
utility mining.
1.Introduction
This article surveys the popular algorithm on utility mining and its further development.
Data mining algorithms mines patterns and frequent itemset mining extended from data
mining produces frequent patterns. In the age of Big Data, uncertainty is very common in
data. Data is constantly growing in volume, variety, velocity and uncertainty. Uncertain data
is found in abundance today on the web, in sensor networks, within enterprises both in their
structured and unstructured sources. Mining such uncertain data is important to discover
interesting high profitable itemsets. As one of the most fundamental issues of uncertain data
International Journal of Pure and Applied MathematicsVolume 119 No. 16 2018, 4451-4460ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
4451
mining, the problem of mining uncertain frequent item sets has attracted much attention in
the database and data mining communities. Although some efficient approaches of mining
uncertain frequent item sets have been proposed, most of them only consider each item in
one transaction as a random variable and ignore the utility of each item in the real scenarios.
Frequent pattern mining is a popular problem in data mining, which consists in finding
frequent patterns in transaction databases. The objective of frequent itemset mining is to find
frequent itemsets. Many well-known algorithms are available to discover frequent itemsets such
as Apriori, FP-Growth, LCM, Eclat, etc. With minimum support threshold, the algorithms return
all the itemsets that appears in at least minimum transactions as specified.
For example, consider the transactional database with detailed transactions and items
with profit values,
Item Profit per unit
a 5
b 2
c 1
d 2
e 3
For the above given sample transactions with minsup value as 2, c will be identified as
most frequent item since it is present in all the transactions. But by considering the quantities and
profit of the items, A high-utility itemset mining algorithm outputs all the high-utility itemsets,
that is the itemsets that generates at least ―minutil‖ profit. For example, consider that ―minutil‖
is set to 25 by the user. The result of a high utility itemset mining algorithm would be the
following.
High utility itemsets: {a,c}:28, {a,b,c,d,e}:25, {b,c,d}:34, {b,c,e}:37, {b,d,e}:36, {c,e}:27,
{a,c,e}:31, {b,c}:28, {b,c,d,e}:40, {b,d}:30, {b,e}:31.
Transaction ID Items with quantities
P1 a(1),b(5),c(1),d(3),e(1)
P2 b(4),c(3),d(3),e(1)
P3 a(1),c(1),d(1)
P4 a(2),c(6),e(2)
P5 b(2),c(2),e(1)
International Journal of Pure and Applied Mathematics Special Issue
4452
So, the limitation of frequent itemset mining is that the itemset with actually high profit
will not be discovered as interesting or frequent itemset and it also finds some frequent itemsets
that are not interesting. Ultimately frequent itemset mining may miss out some rare patterns that
are highly profitable in a transaction database.
To address these limitations, the problem of frequent itemset mining has been redefined
as the problem of high-utility itemset mining. In addition, to prune the search space in frequent
itemset mining apriori property is used which says if an itemset is infrequent then its superset
will also be infrequent. But this is not the case in high-utility itemset mining and hence it is
interesting than frequent itemset mining.
2. Algorithms to mine high utility itemset
2.1 Algorithm 1: A multi-objective evolutionary algorithm for mining frequent and high
utility itemsets [6]
This algorithm aims at mining itemset that is both frequent and with high utility. Many quoted
already existing algorithms like FP – growth, HUI – miner, HUIM – ACS and TKU – miner
based on the weight parameter Ɵ , where Ɵ is the weight parameter that decides the importance
of utility over support and it can be decided by the user. This multi-objective algorithm refers to
two objectives, support and utility. As an evolutionary algorithm[5] this works as maximization
problem, as the ultimate aim is to find itemsets with maximum support and maximum utility.
But, the irony is, the two measures (support, utility) conflict with each other. In other words, the
itemset with high support may lead to have low utility and itemset with high utility often leads to
low support. Hence this algorithm is framed a optimization algorithm between the two measures.
This can be represented as,
Maximize F(X) = max {(supp(X), util(X))T}
Where F(X) is the optimization function, X represents the itemset, T refers to transaction.
Something that need to be noted here is, min_sup and min_util values are not needed as like
other mentioned algorithms where itemsets will be mined with aim of having support and utility
greater than or equal to min_sup and min_util threshold values as specified by the user.
International Journal of Pure and Applied Mathematics Special Issue
4453
Two more parameters has been proposed in this algorithm to evaluate the quality of the
recommended itemsets by this algorithm, they are HyperVolume(HV) and Coverage (Cov) to
measure the convergence and diversity of the recommended itemsets in the list. This algorithm
has been applied on twelve real data sets and they have plotted the comparison results.
From the observation of the results, this algorithm works better than other considered
algorithms in recommending the itemsets with comparatively high support and high utility. On
the other hand, other algorithms produces itemsets either with high support/low utility or low
support/high utility when compared to itemsets produced by MOEA-FHUI.
2.2 Algorithm 2: RUP/FRUP-Growth: An efficient algorithm for mining high utility
itemsets [3]
This algorithm is designed to mine frequent and high utility itemsets. They proposed an
improvement to UP-Growth algorithm as RUP-Growth and then it is developed into FRUP-
Growth algorithm. This considers both minimum support and minimum utility threshold value.
There are many existing algorithms stated here to mine such itemsets but their performance is
decided by the number of candidate itemsets to mine. The number of candidate itemsets will get
increased with decreasing minimum utility and increasing of count of lengthy transactions.
Here, utility of an item is defined as product of internal utility and external utility. Internal
utility of an item refers to the quantity if the item within the transaction. Profit value of an item
which is not available in the transactions is defined as external utility.
Utility is represented as,
u(i,t) = p(i) X q(i,t)
where u(i,t) is utility of item i in transaction t, p(i) (external utility) is profit of item i irrespective
of the transaction, q(i,t) (internal utility) is quantity of item i in transaction t. Further it is
extended to compute utility of an itemset X in a transaction T, by adding the utility of all the
items present in the itemset X in that transaction T. Utility of an itemset X in the given database is
calculated by adding the utility of the itemset in all the transactions.
This approach is divided into two phases. Initially UP-Growth algorithm is improved and
that is referred as RUP-Growth algorithm and further by adopting minimum support and
International Journal of Pure and Applied Mathematics Special Issue
4454
minimum utility threshold values to mine frequent and high utility itemset, there evolves the
FRUP-Growth algorithm.
Collectively, these two improved approaches has three steps and they are,
(i) Construct an UP – Tree
(ii) Mine candidates for frequent and high utility itemset based on tree from (i)
(iii) Identify actual frequent and high utility itemset
This approach concludes that before identifying the actual high utility itemset, reduce the number
of candidate itemset. As per the result quoted, RUP-Growth outperforms the earlier algorithm.
2.3 Algorithm 3: High utility-itemset mining and privacy-preserving utility mining [2]
Mining high utility itemsets from the candidate itemset within given database is HUIM –
High utility itemset mining. The drawback is, it may lead to publish private or secure data in
mined high utility itemset. To overcome this, privacy-preserving utility mining (PPUM) is used
to hide the private high utility itemset mined from the candidate itemset. They proposed two
evolutionary algorithms one to find the high utility itemset and the other to perform PPUM[3].
The evolutionary algorithm for mining high utility itemset constitutes four processes and
they are, pre-processing, particle encoding, fitness evaluation and updating process. Similarly,
the proposed evolutionary algorithm for PPUM ultimately hides the sensitive private high utility
itemset identified from the previous evolutionary algorithm. It outperforms the HUPEumu-
GRAM algorithm in runtime.
2.4 Algorithm 4: Efficiently mining of Effective web traversal patterns with average utility
[7]
This algorithm deals with finding high average utility web patterns. Issue in already
existing algorithm that is overcome by this proposed algorithm is that, the existing algorithms
calculate transaction weighted utility by adding utility of all the transactions in which it exists
and the prefix of that transaction is not considered. The algorithm proposed addresses these
issues in already existing algorithms.
Usually, the utility will be calculated by adding the internal and external utility. Here,
only the internal utility of the transaction is considered. Also, utility value increases with the
International Journal of Pure and Applied Mathematics Special Issue
4455
pattern length, longer pattern with less utility may result in good high values similar to short
length patterns with high utility values. So by choosing the high average utility patterns, it could
be more effective to find the interesting web traversal patterns with effect to length. Ultimately,
this algorithm reduces the search space for finding the effective web path traversal patterns.
Similarly, the transaction weighted utility is calculated only with the projected sequence and not
by adding utility of all the transactions where it exists which is an issue in existing algorithm
addressed by the algorithm proposed.
2.5 Algorithm 5: Mining of high utility itemsets of size-2 with pruning strategies [1]
The MHUIS-2wPS algorithm utilizes the transactional experiences of the retail stores and
outputs the size-2 clubs. The MHUI-NIV algorithm caters for the items with negative item
values. The dissertation applies various pruning strategies for the discovery of high utility
itemsets. This pruning will help remove the unnecessary formation of the low utility extensions.
The proposed MHUIS-2wPS algorithm follows the sequential approach for finding the
high utility itemsets. Using the utility list, the high utility itemsets will be found. Then applying
the pruning concepts of EUCS and PUCS, the itemsets will be made minimal resulting in the
formation of high utility itemsets. It builds the necessary data structures and parameters for
carrying out the processing. It also initiates the finding of the clubs of items. Later, it checks the
other extra areas i.e. the itemset clubs which can be searched here itself for calling as high utility
or not. Lastly the validation of the formed clubs is done using the decisions of EUCS and PUCS
3. A Comparative study on the algorithms
S.No.
Author Name of the algorithm
Objective Parameters considered
Data set utilized
Advantages Disadvantages
1. Lei
Zhang, Guang
long Fu, Fan
Cheng ,
Jianfeng Qiu,
Yanse
MOEA-
FHUI – (Multi-
Objective Evolutionary Algorithm
for mining Frequent
and High Utility Itemsets
To mine
both frequent and
high utility itemset ( a maximizatio
n problem)
Hypervolume,
Coverage, Support,
Utility
12 real
data sets are used
(USCensus_10%, BMS-
Web-View-
1,etc)
a. No need
of minimum support and
minimum utility threshold
values. b. Only one
run is required for multiple
itemset
a. This is not
compares with similar
objective algorithms. b. Only
frequency and quantity are
considered as measures
International Journal of Pure and Applied Mathematics Special Issue
4456
n Su recommendation
2. Jue
Jin, Shui Wang
RUP/FRUP-
Growth: An efficient algorithm
for mining high utility
itemsets
To mine
frequent and high utility itemsets
Minimum
support, minimum utility,
support, utility
Chain-
store dataset (Californi
a)
a.
Frequency, quantity, profit are
considered as measures
b. Reduces the number of
candidates for high
utility itemsets
a. It requires
user to fix threshold values for
minimum support and
minimum utility b. Support is
not directly dealt in the
approach
3. Jerry Chun-
Wei Lin,
Wensheng Gan,
Philippe
Fournier-Viger,
Lu Yang,
Qiankun Liu,
Jaroslav
Frnda, Lukas Sevcik
, Mirosl
av Voznak
High utility-itemset
mining and privacy-
preserving utility mining
To mine high utility
itemset and hide the
sensitive high utility itemsets in
PPUM
Minimum utility, utility
Chess dataset,
synthetic T10I4D10
0K dataset
a. Privacy in the high
utility itemset is
preserved and hidded.
a. Frequent itemset is not
mined
4. Thilag
u M, Nadar
ajan R
Efficiently
mining of Effective
web
To produce
high average
utility web
Time spent on
a traversal, pattern length,
minimum-
CTI,
kosarak
a. Both
longer patterns
with less
a. External
utility is not considered.
b. All pages
International Journal of Pure and Applied Mathematics Special Issue
4457
traversal patterns with
average utility
traversal pattern
avergae-utility page utility and shorter patterns
with high page utility
is considered b. Pattern
length is considered
as a parameter
are considered to have equal significance.
c. Traversal patterns with
backward references are not
considered
5. Gaura
v Gahlot,
Nagamma
Patil
Mining of
high utility itemsets of size-2 with
pruning strategies
To find high
utility itemset by pruning
Transaction
weighted utility, minimum
utility
Synthetic
dataset
a. Pruning is
applied in identifying the high
utility itemset
a. A
comparison plot is plotted with only 9
transactions
4. Conclusion
To mine high utility itemset from the real-world dataset is getting importance today. As
utility of the item affects the interestingness in the resultant itemset, utility mining emerged from
data mining. In that context, itemset with high utility and high support bring in matching
interestingness as expected. Many algorithms have been proposed to mine frequent itemsets and
after utility mining emerged, lot more algorithms are proposed based on quantity, profit to mine
high utility itemset. Here, we have analyzed broad category of algorithms that works to compute
frequent and high utility itemsets. All the algorithms have outperformed the previous reference
algorithm either in running time or in finding better frequent high utility itemset with
comparatively high support and high utility among the candidate itemset. So, further in mining
frequent high utility itemsets, the various interestingness measure used by all these algorithms
can be collectively used to get better results. Few interestingness measure used here are HV,
Cov, support, utility, internal utility, transactional utility, transactional weight utility, profit,
quantity, time, etc., By combining the measures, further it can be extended by giving weightage
factors to all the interestingness measure so that importance of the measure can be changed
depending upon the application domain and user flexibility.
International Journal of Pure and Applied Mathematics Special Issue
4458
References
1. Gaurav Gahlot, Nagamma Patil, Mining of high utility itemsets of size-2 with pruning
strategies and negative item values for B2C companies based on experiential marketing
approach, Perspectives in Science, 8, 2016, 712-714.
2. Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Lu Yang, Qiankun Liu,
Jaroslav Frnda, Lukas Sevcik, Miroslav Voznak, High utility-itemset mining and privacy-
preserving utility mining, Perspectives in Science, 7, 2016, 74-80.
3. Jue Jin, Shui Wang, RUP/FRUP-Growth: An efficient algorithm for mining high utility
itemsets, Procedia Engineering, 174, 2017, 895-903.
4. Kannimuthu, S., Premalatha, K., 2015. A fast perturbation algorithm using tree structure
for privacy preserving utility mining. Expert Syst. Appl. 42 (3), 1149—1165.
5. Kannimuthu, S., Premalatha, K., 2014. Discovery of high utility itemsets using genetic
algorithm with ranked mutation. Appl. Artif. Intell. 28 (4), 337—359.
6. Lei Zhang, Guanglong Fu, Fan Cheng, Jianfeng Qiu, Yansen Su, MOEA-FHUI – (Multi-
Objective Evolutionary Algorithm for mining Frequent and High Utility Itemsets,
Applied Soft computing, 62, 2018, 974-986.
7. Thilagu M, Nadarajan R, Efficiently mining of Effective web traversal patterns with
average utility, Procedia Technology, 6, 2012, 444-451.
8. Vinod kumar, Ramjeevan Singh Thakur, High Fuzzy Utility Strategy Based Webpages
sets mining from weblog database, International Journal of Intelligent Engineering and
Systems, 2017.
International Journal of Pure and Applied Mathematics Special Issue
4459