investigations on modern algorithm for utility mining

10
INVESTIGATIONS ON MODERN ALGORITHM FOR UTILITY MINING 1 Nandhini S S, , 2 Kannimuthu S 1 Assistant Professor/CSE Bannari Amman Institute of Technology Erode, TamilNadu , India 2 Associate Professor/CSE Karpagam College of Engineering Coimbatore, TamilNadu, India [email protected], Abstract It is obvious that data mining will generate millions of patterns from data given. The irony is, the resulting patterns itself need to be mined on a loop. Since most of identified patterns from traditional data mining algorithm is already known to the group who owns the data or on the other hand the pattern mined may not possess anything useful commercially and may not bring- in profit to the group. So, the resultant patterns look cluttered with most non-profitable, unwanted, infrequent and uninteresting patterns since the data is uncertain. These so called unwanted, uninterested patterns can be removed by applying frequent itemset mining and yet to clear the clutter, apply high utility itemset mining which provides a clutter-free patterns which are frequent as well as profitable. This survey review the various algorithms proposed for mining high utility itemset from uncertain databases and compares them based on the domain, data structure used, data set taken for utilization and with that it gives strategies for selecting an appropriate algorithm for applications and identifies opportunities for further development in utility mining. 1.Introduction This article surveys the popular algorithm on utility mining and its further development. Data mining algorithms mines patterns and frequent itemset mining extended from data mining produces frequent patterns. In the age of Big Data, uncertainty is very common in data. Data is constantly growing in volume, variety, velocity and uncertainty. Uncertain data is found in abundance today on the web, in sensor networks, within enterprises both in their structured and unstructured sources. Mining such uncertain data is important to discover interesting high profitable itemsets. As one of the most fundamental issues of uncertain data International Journal of Pure and Applied Mathematics Volume 119 No. 16 2018, 4451-4460 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 4451

Upload: others

Post on 18-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

INVESTIGATIONS ON MODERN ALGORITHM FOR UTILITY MINING

1Nandhini S S,, 2Kannimuthu S

1Assistant Professor/CSE

Bannari Amman Institute of Technology

Erode, TamilNadu, India 2Associate Professor/CSE

Karpagam College of Engineering

Coimbatore, TamilNadu, India

[email protected], Abstract

It is obvious that data mining will generate millions of patterns from data given. The irony is,

the resulting patterns itself need to be mined on a loop. Since most of identified patterns from

traditional data mining algorithm is already known to the group who owns the data or on the

other hand the pattern mined may not possess anything useful commercially and may not bring-

in profit to the group. So, the resultant patterns look cluttered with most non-profitable,

unwanted, infrequent and uninteresting patterns since the data is uncertain. These so called

unwanted, uninterested patterns can be removed by applying frequent itemset mining and yet to

clear the clutter, apply high utility itemset mining which provides a clutter-free patterns which

are frequent as well as profitable. This survey review the various algorithms proposed for mining

high utility itemset from uncertain databases and compares them based on the domain, data

structure used, data set taken for utilization and with that it gives strategies for selecting an

appropriate algorithm for applications and identifies opportunities for further development in

utility mining.

1.Introduction

This article surveys the popular algorithm on utility mining and its further development.

Data mining algorithms mines patterns and frequent itemset mining extended from data

mining produces frequent patterns. In the age of Big Data, uncertainty is very common in

data. Data is constantly growing in volume, variety, velocity and uncertainty. Uncertain data

is found in abundance today on the web, in sensor networks, within enterprises both in their

structured and unstructured sources. Mining such uncertain data is important to discover

interesting high profitable itemsets. As one of the most fundamental issues of uncertain data

International Journal of Pure and Applied MathematicsVolume 119 No. 16 2018, 4451-4460ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

4451

mining, the problem of mining uncertain frequent item sets has attracted much attention in

the database and data mining communities. Although some efficient approaches of mining

uncertain frequent item sets have been proposed, most of them only consider each item in

one transaction as a random variable and ignore the utility of each item in the real scenarios.

Frequent pattern mining is a popular problem in data mining, which consists in finding

frequent patterns in transaction databases. The objective of frequent itemset mining is to find

frequent itemsets. Many well-known algorithms are available to discover frequent itemsets such

as Apriori, FP-Growth, LCM, Eclat, etc. With minimum support threshold, the algorithms return

all the itemsets that appears in at least minimum transactions as specified.

For example, consider the transactional database with detailed transactions and items

with profit values,

Item Profit per unit

a 5

b 2

c 1

d 2

e 3

For the above given sample transactions with minsup value as 2, c will be identified as

most frequent item since it is present in all the transactions. But by considering the quantities and

profit of the items, A high-utility itemset mining algorithm outputs all the high-utility itemsets,

that is the itemsets that generates at least ―minutil‖ profit. For example, consider that ―minutil‖

is set to 25 by the user. The result of a high utility itemset mining algorithm would be the

following.

High utility itemsets: {a,c}:28, {a,b,c,d,e}:25, {b,c,d}:34, {b,c,e}:37, {b,d,e}:36, {c,e}:27,

{a,c,e}:31, {b,c}:28, {b,c,d,e}:40, {b,d}:30, {b,e}:31.

Transaction ID Items with quantities

P1 a(1),b(5),c(1),d(3),e(1)

P2 b(4),c(3),d(3),e(1)

P3 a(1),c(1),d(1)

P4 a(2),c(6),e(2)

P5 b(2),c(2),e(1)

International Journal of Pure and Applied Mathematics Special Issue

4452

So, the limitation of frequent itemset mining is that the itemset with actually high profit

will not be discovered as interesting or frequent itemset and it also finds some frequent itemsets

that are not interesting. Ultimately frequent itemset mining may miss out some rare patterns that

are highly profitable in a transaction database.

To address these limitations, the problem of frequent itemset mining has been redefined

as the problem of high-utility itemset mining. In addition, to prune the search space in frequent

itemset mining apriori property is used which says if an itemset is infrequent then its superset

will also be infrequent. But this is not the case in high-utility itemset mining and hence it is

interesting than frequent itemset mining.

2. Algorithms to mine high utility itemset

2.1 Algorithm 1: A multi-objective evolutionary algorithm for mining frequent and high

utility itemsets [6]

This algorithm aims at mining itemset that is both frequent and with high utility. Many quoted

already existing algorithms like FP – growth, HUI – miner, HUIM – ACS and TKU – miner

based on the weight parameter Ɵ , where Ɵ is the weight parameter that decides the importance

of utility over support and it can be decided by the user. This multi-objective algorithm refers to

two objectives, support and utility. As an evolutionary algorithm[5] this works as maximization

problem, as the ultimate aim is to find itemsets with maximum support and maximum utility.

But, the irony is, the two measures (support, utility) conflict with each other. In other words, the

itemset with high support may lead to have low utility and itemset with high utility often leads to

low support. Hence this algorithm is framed a optimization algorithm between the two measures.

This can be represented as,

Maximize F(X) = max {(supp(X), util(X))T}

Where F(X) is the optimization function, X represents the itemset, T refers to transaction.

Something that need to be noted here is, min_sup and min_util values are not needed as like

other mentioned algorithms where itemsets will be mined with aim of having support and utility

greater than or equal to min_sup and min_util threshold values as specified by the user.

International Journal of Pure and Applied Mathematics Special Issue

4453

Two more parameters has been proposed in this algorithm to evaluate the quality of the

recommended itemsets by this algorithm, they are HyperVolume(HV) and Coverage (Cov) to

measure the convergence and diversity of the recommended itemsets in the list. This algorithm

has been applied on twelve real data sets and they have plotted the comparison results.

From the observation of the results, this algorithm works better than other considered

algorithms in recommending the itemsets with comparatively high support and high utility. On

the other hand, other algorithms produces itemsets either with high support/low utility or low

support/high utility when compared to itemsets produced by MOEA-FHUI.

2.2 Algorithm 2: RUP/FRUP-Growth: An efficient algorithm for mining high utility

itemsets [3]

This algorithm is designed to mine frequent and high utility itemsets. They proposed an

improvement to UP-Growth algorithm as RUP-Growth and then it is developed into FRUP-

Growth algorithm. This considers both minimum support and minimum utility threshold value.

There are many existing algorithms stated here to mine such itemsets but their performance is

decided by the number of candidate itemsets to mine. The number of candidate itemsets will get

increased with decreasing minimum utility and increasing of count of lengthy transactions.

Here, utility of an item is defined as product of internal utility and external utility. Internal

utility of an item refers to the quantity if the item within the transaction. Profit value of an item

which is not available in the transactions is defined as external utility.

Utility is represented as,

u(i,t) = p(i) X q(i,t)

where u(i,t) is utility of item i in transaction t, p(i) (external utility) is profit of item i irrespective

of the transaction, q(i,t) (internal utility) is quantity of item i in transaction t. Further it is

extended to compute utility of an itemset X in a transaction T, by adding the utility of all the

items present in the itemset X in that transaction T. Utility of an itemset X in the given database is

calculated by adding the utility of the itemset in all the transactions.

This approach is divided into two phases. Initially UP-Growth algorithm is improved and

that is referred as RUP-Growth algorithm and further by adopting minimum support and

International Journal of Pure and Applied Mathematics Special Issue

4454

minimum utility threshold values to mine frequent and high utility itemset, there evolves the

FRUP-Growth algorithm.

Collectively, these two improved approaches has three steps and they are,

(i) Construct an UP – Tree

(ii) Mine candidates for frequent and high utility itemset based on tree from (i)

(iii) Identify actual frequent and high utility itemset

This approach concludes that before identifying the actual high utility itemset, reduce the number

of candidate itemset. As per the result quoted, RUP-Growth outperforms the earlier algorithm.

2.3 Algorithm 3: High utility-itemset mining and privacy-preserving utility mining [2]

Mining high utility itemsets from the candidate itemset within given database is HUIM –

High utility itemset mining. The drawback is, it may lead to publish private or secure data in

mined high utility itemset. To overcome this, privacy-preserving utility mining (PPUM) is used

to hide the private high utility itemset mined from the candidate itemset. They proposed two

evolutionary algorithms one to find the high utility itemset and the other to perform PPUM[3].

The evolutionary algorithm for mining high utility itemset constitutes four processes and

they are, pre-processing, particle encoding, fitness evaluation and updating process. Similarly,

the proposed evolutionary algorithm for PPUM ultimately hides the sensitive private high utility

itemset identified from the previous evolutionary algorithm. It outperforms the HUPEumu-

GRAM algorithm in runtime.

2.4 Algorithm 4: Efficiently mining of Effective web traversal patterns with average utility

[7]

This algorithm deals with finding high average utility web patterns. Issue in already

existing algorithm that is overcome by this proposed algorithm is that, the existing algorithms

calculate transaction weighted utility by adding utility of all the transactions in which it exists

and the prefix of that transaction is not considered. The algorithm proposed addresses these

issues in already existing algorithms.

Usually, the utility will be calculated by adding the internal and external utility. Here,

only the internal utility of the transaction is considered. Also, utility value increases with the

International Journal of Pure and Applied Mathematics Special Issue

4455

pattern length, longer pattern with less utility may result in good high values similar to short

length patterns with high utility values. So by choosing the high average utility patterns, it could

be more effective to find the interesting web traversal patterns with effect to length. Ultimately,

this algorithm reduces the search space for finding the effective web path traversal patterns.

Similarly, the transaction weighted utility is calculated only with the projected sequence and not

by adding utility of all the transactions where it exists which is an issue in existing algorithm

addressed by the algorithm proposed.

2.5 Algorithm 5: Mining of high utility itemsets of size-2 with pruning strategies [1]

The MHUIS-2wPS algorithm utilizes the transactional experiences of the retail stores and

outputs the size-2 clubs. The MHUI-NIV algorithm caters for the items with negative item

values. The dissertation applies various pruning strategies for the discovery of high utility

itemsets. This pruning will help remove the unnecessary formation of the low utility extensions.

The proposed MHUIS-2wPS algorithm follows the sequential approach for finding the

high utility itemsets. Using the utility list, the high utility itemsets will be found. Then applying

the pruning concepts of EUCS and PUCS, the itemsets will be made minimal resulting in the

formation of high utility itemsets. It builds the necessary data structures and parameters for

carrying out the processing. It also initiates the finding of the clubs of items. Later, it checks the

other extra areas i.e. the itemset clubs which can be searched here itself for calling as high utility

or not. Lastly the validation of the formed clubs is done using the decisions of EUCS and PUCS

3. A Comparative study on the algorithms

S.No.

Author Name of the algorithm

Objective Parameters considered

Data set utilized

Advantages Disadvantages

1. Lei

Zhang, Guang

long Fu, Fan

Cheng ,

Jianfeng Qiu,

Yanse

MOEA-

FHUI – (Multi-

Objective Evolutionary Algorithm

for mining Frequent

and High Utility Itemsets

To mine

both frequent and

high utility itemset ( a maximizatio

n problem)

Hypervolume,

Coverage, Support,

Utility

12 real

data sets are used

(USCensus_10%, BMS-

Web-View-

1,etc)

a. No need

of minimum support and

minimum utility threshold

values. b. Only one

run is required for multiple

itemset

a. This is not

compares with similar

objective algorithms. b. Only

frequency and quantity are

considered as measures

International Journal of Pure and Applied Mathematics Special Issue

4456

n Su recommendation

2. Jue

Jin, Shui Wang

RUP/FRUP-

Growth: An efficient algorithm

for mining high utility

itemsets

To mine

frequent and high utility itemsets

Minimum

support, minimum utility,

support, utility

Chain-

store dataset (Californi

a)

a.

Frequency, quantity, profit are

considered as measures

b. Reduces the number of

candidates for high

utility itemsets

a. It requires

user to fix threshold values for

minimum support and

minimum utility b. Support is

not directly dealt in the

approach

3. Jerry Chun-

Wei Lin,

Wensheng Gan,

Philippe

Fournier-Viger,

Lu Yang,

Qiankun Liu,

Jaroslav

Frnda, Lukas Sevcik

, Mirosl

av Voznak

High utility-itemset

mining and privacy-

preserving utility mining

To mine high utility

itemset and hide the

sensitive high utility itemsets in

PPUM

Minimum utility, utility

Chess dataset,

synthetic T10I4D10

0K dataset

a. Privacy in the high

utility itemset is

preserved and hidded.

a. Frequent itemset is not

mined

4. Thilag

u M, Nadar

ajan R

Efficiently

mining of Effective

web

To produce

high average

utility web

Time spent on

a traversal, pattern length,

minimum-

CTI,

kosarak

a. Both

longer patterns

with less

a. External

utility is not considered.

b. All pages

International Journal of Pure and Applied Mathematics Special Issue

4457

traversal patterns with

average utility

traversal pattern

avergae-utility page utility and shorter patterns

with high page utility

is considered b. Pattern

length is considered

as a parameter

are considered to have equal significance.

c. Traversal patterns with

backward references are not

considered

5. Gaura

v Gahlot,

Nagamma

Patil

Mining of

high utility itemsets of size-2 with

pruning strategies

To find high

utility itemset by pruning

Transaction

weighted utility, minimum

utility

Synthetic

dataset

a. Pruning is

applied in identifying the high

utility itemset

a. A

comparison plot is plotted with only 9

transactions

4. Conclusion

To mine high utility itemset from the real-world dataset is getting importance today. As

utility of the item affects the interestingness in the resultant itemset, utility mining emerged from

data mining. In that context, itemset with high utility and high support bring in matching

interestingness as expected. Many algorithms have been proposed to mine frequent itemsets and

after utility mining emerged, lot more algorithms are proposed based on quantity, profit to mine

high utility itemset. Here, we have analyzed broad category of algorithms that works to compute

frequent and high utility itemsets. All the algorithms have outperformed the previous reference

algorithm either in running time or in finding better frequent high utility itemset with

comparatively high support and high utility among the candidate itemset. So, further in mining

frequent high utility itemsets, the various interestingness measure used by all these algorithms

can be collectively used to get better results. Few interestingness measure used here are HV,

Cov, support, utility, internal utility, transactional utility, transactional weight utility, profit,

quantity, time, etc., By combining the measures, further it can be extended by giving weightage

factors to all the interestingness measure so that importance of the measure can be changed

depending upon the application domain and user flexibility.

International Journal of Pure and Applied Mathematics Special Issue

4458

References

1. Gaurav Gahlot, Nagamma Patil, Mining of high utility itemsets of size-2 with pruning

strategies and negative item values for B2C companies based on experiential marketing

approach, Perspectives in Science, 8, 2016, 712-714.

2. Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Lu Yang, Qiankun Liu,

Jaroslav Frnda, Lukas Sevcik, Miroslav Voznak, High utility-itemset mining and privacy-

preserving utility mining, Perspectives in Science, 7, 2016, 74-80.

3. Jue Jin, Shui Wang, RUP/FRUP-Growth: An efficient algorithm for mining high utility

itemsets, Procedia Engineering, 174, 2017, 895-903.

4. Kannimuthu, S., Premalatha, K., 2015. A fast perturbation algorithm using tree structure

for privacy preserving utility mining. Expert Syst. Appl. 42 (3), 1149—1165.

5. Kannimuthu, S., Premalatha, K., 2014. Discovery of high utility itemsets using genetic

algorithm with ranked mutation. Appl. Artif. Intell. 28 (4), 337—359.

6. Lei Zhang, Guanglong Fu, Fan Cheng, Jianfeng Qiu, Yansen Su, MOEA-FHUI – (Multi-

Objective Evolutionary Algorithm for mining Frequent and High Utility Itemsets,

Applied Soft computing, 62, 2018, 974-986.

7. Thilagu M, Nadarajan R, Efficiently mining of Effective web traversal patterns with

average utility, Procedia Technology, 6, 2012, 444-451.

8. Vinod kumar, Ramjeevan Singh Thakur, High Fuzzy Utility Strategy Based Webpages

sets mining from weblog database, International Journal of Intelligent Engineering and

Systems, 2017.

International Journal of Pure and Applied Mathematics Special Issue

4459

4460