issn: 0975-766x coden: ijptfi available online … called dahu (discovery of all high utility),...

Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology

IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 2088

ISSN: 0975-766X CODEN: IJPTFI

Available Online through Research Article

www.ijptonline.com MINING CLOSED HIGH UTILITY DATASETS FOR CONCISE AND LOSSLESS

TRANSACTIONS R. Divya

1, Dr. G. Mathivanan*

2

PG Student, Department of Information Technology, Sathyabama University, Chennai.

Head of the department, Department of Information Technology, Sathyabama University, Chennai.

Email: [email protected]

Received on 29-04-2016 Accepted on 29-05-2016

Abstract

Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different

perspectives. Then the frequent item set mining is the fundamental research topic. Utility mining is the important task of

data mining, although several studies have been carried out, current methods may present too many high utility item sets

for users, which degrades the performance of the mining task in terms of execution and memory efficiency. This

introduced a novel framework in this paper for analyze the frequent and non-frequent Itemset using by Min Ex

algorithm and pruning techniques. And it find out the High Utility Item Set from the No of transaction using MinEx

algorithm, It explores the item set lattice level wise, starting from the empty set and stopping at the level of the largest

frequent free-sets. Then the free-sets that can be extracted efficiently, even on dense data sets and mining closed high

utility item sets, which serves as a compact and lossless representation of high utility item sets. Using an efficient

algorithms called DAHU (Discovery of All High utility), MinEX algorithm Its outperform the state of art algorithm.

Keyword: Utility mining; frequent item set; closed+ high utility item set; lossless and concise representation.

I. Introduction

Frequent item set mining (abbreviated as FIM) may be an elementary analysis topic in data processing. One in every of

its fashionable applications is market basket analysis, that refers to the invention of sets of things (item sets) that area

unit often purchased along by customers. However, during this application, the normal model of FIM might discover an

oversized quantity of frequent item sets with low profit and lose the knowledge on valuable item sets having low

mercantilism frequencies. These issues area unit caused by the facts that (1) FIM treats all things as having constant

mailto:[email protected]


IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12089

importance/unit profit/weight and (2) it assumes that each item in an exceedingly dealings seems in an exceedingly

binary kind, i.e., AN item are often either gift or absent in an exceedingly dealings, that doesn’t indicate its purchase

amount within the dealings. Hence, FIM cannot satisfy the necessity of users United Nations agency need to get item

sets with high utilities like high profits.

To address these problems, utility mining emerges as a very important topic in data processing. In utility mining, every

item encompasses a weight (e.g. unit profit) and might seem over once in every dealings. The utility of an item set

represents its importance, which might be measured in terms of weight, profit, cost, amount or different info counting on

the user preference. An item set is termed a high utility item set (abbreviated as HUI) if its utility is not any but a user-

specified minimum utility threshold. Utility mining encompasses a big selection of applications like web site click

stream analysis [5], cross-marketing analysis [6] and medical specialty domains.

However, HUIs mining isn't a simple task since the downward closure property [1] in FIM doesn't hold in utility mining.

The search area can't be directly cropped to seek out HUIs as in FIM since a superset of a coffee utility item set are often

a high utility item set. Several studies [2]were projected for mining HUIs, however they usually gift an oversized range

of high utility item sets to users specified comprehension of the results becomes tough. Meanwhile, the algorithms

become inefficient in terms of your time and memory demand. above all, the performance of the mining task decreases

greatly beneath low minimum utility thresholds or dense databases.

To reduce the process price in FIM whereas presenting fewer and a lot of necessary patterns to users, several studies

developed concise representations, like free sets [3], non-derivable sets [4], top item sets and closed item sets.

These representations with success scale back the set of item sets found, however they were developed for frequent item

set mining rather than high utility item set mining. Therefore, a very important analysis question is “Is it potential to

conceive a compact and lossless illustration of high utility item sets galvanized by these representations to handle the

same problems in HUI mining?”.

Answering this question completely isn't straightforward. Developing an elliptic and complete illustration of HUIs poses

many challenges:

1. Desegregation ideas of elliptic illustrations from FIM into HUI mining could turn out a lossy illustration of all HUIs

or a representation that's not significant to the users.



2. The illustration might not bring home the bacon a major reduction within the variety of extracted patterns to justify

victimization the illustration.

3. Algorithms for extracting the illustration might not be economical. they'll be slower than the simplest algorithms for

mining all HUIs.

4. It’s going to be onerous to develop associate economical methodology for sick all HUIs from the illustration.

In this paper, it tend to address all of those challenges by proposing a condensed and significant illustration of HUIs

named Closed+ High Utility Item sets (Closed+ HUIs), that integrates the construct of closed item set into HUI mining.

Our contributions square measure four-fold in correspondence to breakdown the four challenges mentioned previously:

The projected illustration is lossless by employing a new structure named utility unit array that enables sick all HUIs and

their utilities with efficiency.

The projected illustration is additionally compact. Experiments show that it reduces the quantity of item sets by many

orders of magnitude, particularly for datasets containing long HUIs (up to 800 times).

It tend to enhance associate economical formula, named CHUD (Closed+ High Utility item set Discovery), to search out

this illustration. It includes 3 novel methods named REG, RML and DCM that greatly enhance its performance. Results

show that CHUD is way quicker than current best ways for mining all HUIs [2].

To migrate a top-down methodology named DAHU (Derive Definition HUIs from the set of Closed+ HUIs. the mixture

of CHUD and DAHU provides a replacement thanks to get all HUIs and it outperforms UP Growth [4], the progressive

formula for mining HUIs.

The remainder of this paper is organized as follows. In Section II, this tend to introduce the background for compact

representations and utility mining. Section III defines the illustration of closed+ HUIs and presents our ways.

Existing System

In existing system, Many were proposed for mining HUIs, but they often present a large number of high utility itemsets

to users. The system used two efficient one-pass algorithms, MHUI-BIT and MHUI-TID, for mining high utility

itemsets from data streams within a transaction-sensitive sliding window. Two effective representations of item

information and an extended lexicographical tree-based summary data structure are developed to improve the efficiency

of mining high utility itemsets. These representations successfully reduce the number of itemsets found, but they are



developed for FIM instead of HUI mining.

Proposed System

This system introduced a novel framework for mining closed high utility itemsets (CHUIs ), which serves as a compact

and lossless representation of HUIs. Further, a method called DAHU (Derive All High Utility itemsets) is proposed to

recover all high utility itemsets from the set of closed + high utility itemsets without accessing the original database.

Results of experiments on real and synthetic datasets show that CHUD and DAHU are very efficient with a massive in

the number of high utility itemsets. In addition, when all high utility itemsets are recovered by DAHU, the approach

combining CHUD and DAHU also outperforms the state-of-the-art algorithms in mining high utility item sets.

Related Works

In this section, there is a tendency to introduce the preliminaries related to high utility item set mining and compact

representations.

A. Closed Item set Mining

In this segment, it tend to introduce definitions and properties associated with closed item sets and mention relevant

ways. For a lot of details concerning closed item sets.

Mining frequent closed item set refers to the invention of all the closed item sets whose supports are not any but a user-

specified threshold. it's well known that the amount of frequent closed item sets is a lot of smaller than the set of

frequent item sets for real-life databases which mining frequent closed item sets can even be a lot of quicker and

memory economical than mining frequent item sets. The set of closed item sets is lossless since all frequent item sets

and their supports is simply derived from it by property four while not scanning the first information [5-7]. several

economical ways were planned for mining frequent closed item sets, like A-Close, CLOSET+ , CHARM and DCI-

Closed. However, these ways don't think about the utility of item sets. Therefore, they'll gift countless closed item sets

with low utilities to users and omit many high utility item sets.

B. Compact Representations of High Utility

Itemsets

To gift representative HUIs to users, some epigrammatic representations of HUIs were planned. Chan et al.

introduced the construct of utility frequent closed patterns [7]. However, it's supported a definition of high utility itemset



that's totally different from [3] our work.

Shieetal. planned a compact illustration of high utility itemsets, referred to as top high utility item set and therefore the

GUIDE formula for mining it [7]. A HUI is claimed to be top if it's not a set of the other HUI. for instance, once

min_utility = ten, the set of top HUIs is {, }. though this illustration reduces the amount of extracted HUIs, it's not

lossless.

The rationale is that the utilities of the subsets of a top HUI can't be illustrious while not scanning the information.

Besides, ill all HUIs from top HUIs is terribly inefficient as a result of several subsets of a top HUI is low utility.

Another drawback is that the GUIDE formula cannot capture the entire set of top HUIs.

Calculate the TU and Find TWU

In this phase, the high transaction for item sets is calculated using AprioriCH algorithm techniques. First find the

Absolute Utility from the transactions. Multiply the no of item and that items finite Profit unit, the result will produce

the Absolute Utility of that transaction item. The Transaction Utility (TU), is calculated from sum of the Absolute

utility. This value gives the total occurrence of items in the transaction.

TU can be used to find and analyze the transaction-weighted utilization (TWU) of all the transactions.

Server

Fig.1 System Design.

Discovery Of item Transaction

Set Discovery Find HUI

Transaction Closed Itemset

analysis

TU Analysis

Find close High

Find Absolute Utility utility Item set

Find Transaction Profit analysis

Graphical

Notation

Utility



CLOSED + HIGH UTILITY ITEMSET

Mining

In this section, there is a tendency to incorporate the idea of closed itemset with high utility itemset mining to develop an

illustration named closed+ high utility itemset. To have a tendency to on paper prove that this new illustration is

meaning, lossless and not larger than the set of all HUIs.

During this case, a high utility item set is alleged to be losed if it's no correct superset having identical utility. However,

this definition is unlikely to realize a high reduction of the quantity of extracted item sets since not several item sets have

precisely the same utility as their supersets in real datasets. as an example, there area unit seven HUIs in Example one

and just one item set is non-closed, since and u() = u() = twelve.

A. Pushing Closed Property into HUI Mining

A second risk is to outline the closure on the first purpose that it should always discuss is the way to incorporate the

closed constraint into high utility itemset mining. First, that are able to outline the closure on the utility of itemsets.

supports of item sets. During this case, there area unit 2 potential definitions betting on the be part of order between the

closed constraint and also the utility constraint:

B. Economical Discovery of Closed+ High

Utility Itemsets

In this section, it tend to gift AN economical algorithmic program named CHUD (Closed+ High Utility itemset

Discovery) for mining closed+ HUIs. CHUD is AN extension of DCI-Closed [4], one among the best strategies for

mining closed item sets, and it additionally integrates the TWU model and effective methods to prune low utility

itemsets. CHUD consists of 2 phases. In phase I, CHUD discovers candidates for closed+ HUIs. In clinical test, the

closed+ HUIs square measure known from the set of candidates found in clinical test and their utility unit arrays square

measure computed by scanning the info once.

Similar to the DCI-Closed algorithmic program, CHUD adopts AN IT-Tree (Itemset-Tidset try Tree) to search out

closed+ HUIs. In AN IT-Tree, every node N(X) consists of AN itemset X, its Tidset g(X), and 2 ordered sets of things

named PREV-SET(X) and POST-SET(X). The IT-Tree is recursively explored by the CHUD algorithmic program till

all closed itemsets that square measure HTWUIs square measure generated. completely different from the DCI-Closed



algorithmic program, every node N(X) of the IT-Tree is connected with AN calculable utility price EstU(X).

A data structure known as TU-Table (Transaction Utility Table) [3] is adopted for storing the dealing utilities of

transactions. it's an inventory of pairs, TU(TR); wherever the primary price could be a TID R and therefore the second

price is that the dealing utility of TR. Given a TID R, the worth TU(TR) is expeditiously retrieved from the TU-Table.

Given a node N(X) with its Tidset g(X) and a TU-Table TU, the calculable utility of the itemset X is expeditiously

calculated by the procedure shown in Figure one.

The main procedure of CHUD is known as main and is shown in Figure two. It takes as parameter a info D and therefore

the min_utility threshold. CHUD 1st scans D once to convert D into a vertical info. At an equivalent time, CHUD

computes the dealing utility for every dealing TR and calculates TWU of things. once a dealing is retrieved, its Tid and

dealing utility square measure loaded into a worldwide TU-Table named GTU. AN item is termed a promising item if its

calculable utility (e.g. its TWU) isn't any but min_utility. when the primary scan of info, promising things square

measure collected into AN ordered list O = <a1, a2,…,an>, sorted per a hard and fast order like increasing order of

support. solely promising things square measure unbroken in O since supersets of unfortunate things square measure low

utility itemsets per [7], the utilities of unfortunate things is off from the GTU table. This step is performed at line two of

the most procedure. Then, CHUD generates candidates in an exceedingly algorithmic manner, ranging from candidates

containing one promising item and recursively connection things to them to create larger candidates. To do so, CHUD

takes advantage of the actual fact that by victimization the whole order , the entire set of itemsets is divided into n non-

overlapping subspaces, wherever the k-th topological space is that the set of itemsets containing the item AK however

no item ai AK [4]. for every item AK O, CHUD creates a node N() and puts things a1 to ak-1 into PREV- SET() and

things ak+1 to an into POST-SET(). Then CHUD calls the CHUD Phase-I procedure for every node N() to provide all

the candidates containing the item AK however no item ai AK. Finally, the most procedure performs clinical test on

these candidates to get all closed+ HUIs.

Pseudocode : HUIMining

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};



for (k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in

Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return k Lk;

In this section, it tend to compare the performance of CHUD and DAHU with UP Growth [2], that is to our greatest

data, the progressive technique for prime utility itemset mining. though CHUD and UP Growth manufacture totally

different results, each of them incorporates 2 phases. In Phase I, CHUD and UP Growth severally generate candidates

for CHUIs and HUIs. In clinical trial, CHUD and UP Growth severally establish CHUIs and HUIs from candidates

created in their phase I. the mix of CHUD and DAHU is denoted as CHUD+DAHU, that 1st applies CHUD to seek out

all closed+ high utility itemsets so uses DAHU to derive all high utility itemsets from the set of closed+ high utility

itemsets generated by CHUD. the method of CHUD+DAHU in phase I is that the same as that of CHUD. In clinical

trial, CHUD+DAHU 1st identifies CHUIs from the set of candidates so uses CHUIs to derive all HUIs. Experiments

were performed on a personal computer with associate Intel® Core two Quad Processor @ two.66 GHz running

Windows XP and a couple of GB of RAM. CHUD and DAHU were enforced in Java. The implementation of UP

Growth was obtained from Tseng et al. [2], that is additionally enforced in Java. All memory measurements were done

by victimization the Java API Real datasets Mushroom and BMSWebView1 were obtained from FIMI Repository [2].

foodmart may be a real dataset obtained from the Microsoft food mart 2000 information. Except the foodmart dataset,

the external and internal utility of every item are generated with the settings employed in [2]. Food mart already contains

unit profits and buy quantities of things. T

he whole utility of foodmart is one hundred twenty,160.84. Table IV shows the characteristics of the on top of datasets.

Mushroom may be a real-life dense dataset, every dealing containing twenty three things. foodmart may be a real-life

thin dataset from a mercantile establishment, with real utility values. BMSWebView1 may be a real-life thin information

set of click-stream data with a mixture of short and long transactions (up to 267 items). T10I8D200K may be a giant thin

dataset with a mean dealing length of ten.



A. Experiments on Mushroom Dataset

The first experiment consisted of running UP Growth, CHUD, and DAHU on the Mushroom dataset, whereas varied

min_utility from 100% to I Chronicles. The execution time of UP Growth, CHUD, and CHUD+DAHU is shown in

Figure eight for phase I and clinical trial. Results show that CHUD outperforms UP Growth for each phases, and also

the performance gap accrued as min_utility was set lower. for instance, once min_utility = one hundred and twenty fifth,

CHUD is fifty times quicker than UP Growth for part one and sixty three times quicker for clinical trial. Moreover, once

CHUD is combined with DAHU to find all high utility item sets, the mix for the most part outperforms UP Growth and

was solely slightly slower than CHUD. The smaller range of candidates generated by CHUD in phase I is what makes

CHUD perform higher than UP Growth in clinical trial and for the whole execution time (because clinical trial is a lot of

expensive than phase I [20]). Lastly, it tend to measure the reduction achieved by the illustration of closed+ high utility

itemsets generated by CHUD compared to the set of all high utility itemsets generated by UP Growth. As shown in

Table V, a large reduction is obtained (up to 796 times). Moreover, by running DAHU, it's doable to recover all high

utility itemsets.

B. Experiments on Foodmart Dataset

The second experiment consists of running UP Growth, CHUD and DAHU on the Foodmart dataset, whereas varied

min_utility from zero.10% to 0.005 capitalize on the whole utility within the information. Execution times for phase I

and clinical trial are shown in Figure nine. the whole execution time of UP Growth is a smaller amount than CHUD,

initially. however because the min_utility threshold became smaller, CHUD becomes quicker (up to 2 times quicker

than UP Growth).

The rationale why the performance gap between CHUD and UP Growth is smaller for Foodmart than for Mushroom is

owing to the very fact that Foodmart may be a thin dataset. As a consequence the reduction achieved by mining closed+

high utility itemsets is a smaller amount. Fig.2 Mining closed High UD note that achieving a smaller reduction for thin

datasets may be a well-known development in frequent closed itemset mining. an identical development happens in

closed+ HUI mining. Besides, once DAHU was combined withes".

CHUD, the execution time of CHUD+DAHU was up to 2 times quicker than UP Growth for low minimum utility

thresholds and slightly slower than CHUD.



C. Experiments Dataset

The third experiment consists of running UP Growth, CHUD and CHUD+DAHU on BMSWebView1 whereas varied

min_utility from 100% to 1 Chronicles of the entire utility of the info. For min_utility = a pair of, UP Growth cannot

terminate among the closing date of a hundred,000 seconds and it generates over one,000,000 candidates in phase I

clinical trial, whereas CHUD terminates in eighty seconds and produces solely seven closed+ HUIs from thirty two

candidates. the explanation why CHUD performs therefore well is that it achieves an enormous reduction within the

range of candidates by solely generating a couple of long thing sets containing up to 149 items, whereas UP Growth

should take into account a huge quantity of redundant subsets (for a closed thing set of 149 items, there are often up to

2149-2 non-empty correct subsets that area unit redundant). DAHU additionally suffers from the very fact that there area

unit too several HUIs. It runs out of memory for min_utility < II Chronicles once making an attempt to recover all

HUIs as a result of it's to get too several subsets.

D. Experiments on Sythetic Dataset

The fourth experiment is to run the algorithms on T12I8D200K with min_utility varied from zero.1% to 0.02% of the

info total utility. Results area unit conferred in Figure eleven and Table VIII. For this dataset, CHUD is quicker than UP

Growth for the entire execution time. though the reduction on this artificial dataset isn't nearly as good (since it made

constant result as UP Growth), CHUD is quicker as a result of it generates concerning thrice less candidates in phase I

clinical trial. CHUD takes additional times to get candidates in phase I clinical trial. however the entire execution time

of CHUD is a smaller amount than UP Growth as a result of clinical trial is additional pricey than phase I clinical trial.



CHUD+DAHU additionally outperforms UP Growth, since DAHU solely pay one second to derive all HUIs.

Conclusion

In this paper, there is a tendency to address the matter of redundancy in high utility itemset mining by proposing a

compact illustration of all high utility itemsets named closed+ high utility itemsets. To our information, this is often the

primary study on compact and lossless illustration of high utility itemsets. To mine this new type of itemsets, There is a

tendency to planned associate degree economical algorithmic program named CHUD. 3 effective ways named REG,

RML and DCM were additional planned to reinforce the performance of CHUD. To expeditiously recover all high

utility itemsets from this illustration, there is a tendency to planned a top-down technique named DAHU. Real and

artificial datasets with varied characteristics were wont to perform a radical performance analysis. Results show that the

planned illustration achieves a huge reduction within the range of high utility itemsets (e.g. a discount of up to 800 times

for Mushroom and thirty two times for Foodmart datasets). Besides, CHUD outperforms UP Growth, this best

algorithmic program by many orders of magnitude below low minimum utility thresholds (e.g. CHUD terminates in

eighty seconds on BMSWebView1 for min_utility = two, whereas UP Growth cannot terminate inside twenty four

hours). the mixture of CHUD and DAHU is additionally quicker than UP Growth once DAHU may be applied.

References

1. R. Agrawal and R. Srikant, “Fast algorithms for mining associa-tion rules,” in Proc. 20th Int. Conf. Very Large Data

Bases, 1994, pp. 487– 499.

2. C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K. Lee, “Efficient tree structures for high utility pattern mining in

incremental data-bases,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 12, pp. 1708–1721, Dec. 2009.

3. J.-F. Boulicaut, A. Bykowski, and C. Rigotti, “Free-sets: A con-densed representation of Boolean data for the

approximation of frequency queries,” Data Mining Knowl. Discovery, vol. 7, no. 1, pp. 5–22, 2003.

4. T. Calders and B. Goethals, “Mining all non-derivable frequent itemsets,” in Proc. Int. Conf. Eur. Conf. Principles

Data Mining Knowl. Discovery, 2002, pp. 74–85.

5. K. Chuang, J. Huang, and M. Chen, “Mining top-k frequent pat-terns in the presence of the memory constraint,”

VLDB J., vol. 17, pp. 1321–1344, 2008.

6. R. Chan, Q. Yang, and Y. Shen, “Mining high utility itemsets,” in Proc. IEEE Int. Conf. Data Min., 2003, pp. 19–


IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 099

26.

7. A. Erwin, R. P. Gopalan, and N. R. Achuthan, “Efficient mining of high utility itemsets from large datasets,” in

Proc. Int. Conf. Pacific-Asia Conf. Knowl. Discovery Data Mining , 2008, pp. 554–561.

Corresponding Author:

R. Divya*,

Email: [email protected]

mailto:[email protected]

issn: 0975-766x coden: ijptfi available online … called dahu (discovery of all high utility),...

Documents