comprehensive study on high utility itemset s …comprehensive study on high utility itemset s...

Comprehensive Study on High Utility Itemsets

Mining with Various Approaches and its Applications 1M. Suneetha and

2M.V.P. Chandra Sekhara Rao

1Research Scholar, Acharya Nagarjuna University.

Assistant Professor, Department of IT, GMRIT, Rajam, Srikakulam,

Andhra Pradesh, India. 2Professor, Department of CSE, RVR & JC College of Engineering,

Guntur, AP, India.

Abstract Discovering high-utility Itemsets in transaction databases is a popular

data mining task. High utility itemset mining addresses the limitations of

frequent itemset mining by introducing measures of interestingness that

reflect the significance of an itemset beyond its frequency of occurrence.

High utility itemset mining problem involves the use of internal and

external utilities of items to discover interesting patterns from a given

transactional database. This paper provides a survey of the HUIM

algorithms, including Apriori-based, tree-based, projection based and

Hybrid approaches. Experimental evaluations on both dense and sparse

datasets show that the performance of the HUIM, FHM, Two-Phase, HUI-

Miner algorithms in terms of execution time, memory usage and number of

candidates. Finally the characteristics and limitations are highlighted.

Key Words:minutil, itemsets, high utility itemset.

International Journal of Pure and Applied MathematicsVolume 119 No. 15 2018, 7-17ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

7

1. Introduction

In real world applications, to take strategic decisions, Data Mining techniques

are used to discover hidden and unknown knowledge from databases. One of

the basic examples is super market.DM techniques are used to identify the pair

of products that are frequently purchased for market promotion. The important

two fundamental tasks are Frequent Itemset Mining and Association Rule

Mining (ARM) [4]. Frequent Itemset Mining is a technique is used to find a pair

of itemsets whose occurrence is more than a given threshold. FIM helps in

finding the hidden associations among the itemsets in database. The basic

algorithms of FIM use candidate generation strategy to find the possible

frequent itemsets and database scan to result the frequent itemsets. Many

extensions have been proposed to carry out the mining process. One of the

extensions is pattern-growth approach[6], it takes two database scans to

discover frequent itemsets without generating candidate itemsets. To address the

issue of low interested patterns, Constraint-based pattern mining algorithms [13]

are introduced. Although traditional FIM algorithms are so popular, it can be

used to reveal knowledge in binary kind of databases, where the itemset is

present or not. An alternative is Quantitative Association rule mining proposed

and extensions are also proposed [7].FIM, ARM and QARM are does not

consider whether the itemsets are profitable or not. Some of the limitations of

FIM are, FPM algorithms successfully extract patterns from transactional

databases, and they consider attribute frequency known to be one. But a real

time application considers more than one attribute and the order among the

items.FPM techniques uses support based framework to derive frequent

patterns.

However the value of the pattern is important to the organization which is not

derived in FPM. For example, FIM based algorithms may miss the more

profitable products diamonds, cars which are rare but gives more profit.

Another example, if a sales analyst involved in some retail research needs to

find out which itemsets in the stores earn the maximum sales revenue for the

stores he or she will define the utility of any itemset as the monetary profit that

the store earns by selling each unit of that itemset.

To address the limitations of FIM, Utility is introduced into FPM to mine

patterns whose utility is high/more profit is called High Utility Pattern Mining

[16].

2. High Utility Itemset Mining (HUIM)

In High Utility Itemset Mining it takes transaction database as input, where

itemsets are associated with a quantity as internal utility, each item is also

associated with quality/profit as external utility. The utility can be measured in

terms of cost quality, profit, or other expressions of user preferences. The utility

of item is calculated from the product of quantity and profit of item in database.

International Journal of Pure and Applied Mathematics Special Issue

8

The utility of itemset is calculated from the sum of the utility of the items of

itemset in a database. The main task of High Utility Itemset Mining (HUIM) is

to discover itemsets whose utility value is not less than the user threshold

utility. To illustrate HUIM, Let us use the retail store example presented in table

1, database contains four transactions, the first column represents Transaction

id, in the second column shows a list of items {a,b,c,d} are paired with a

quantity value that represents amount of quantity purchased. Table 2 shows the

profit values of each item. The first transaction of Table2 shows item a is

associated with profit of 5.

Table 1: Sample Transaction Database

TID Items with Quantity

1 (a,2)(b,1)(d,2)

2 (b,2)(c,1)

3 (a,1)(b,2)(c,3)

4 (b,1)(c,1)(d,2)

Table 2: External Utility

Item a b c d

Profit 5 4 2 3

Utility Framework

This section starts with the problem statement; define the terminology,

techniques and algorithms that are used for deriving high utility itemsets.

Problem Statement:

Let I ={ i1, i2, … , in} be a set of distinct items. An itemset is a nonempty set

that contains one or more items, denoted as X=(x1, x2, … ,xn) where xi⊆I, ∀i= 1,

2, … , n. The size of itemset is denoted as |X| and it is the number of items in it.

For easiness, open brackets are omitted for the itemset, if it contains one item,

and the items that are presented in itemset are in lexicographical order.

Each item i in a data base is associated with quantity value is called with a name

internal utility. And each item is also associated with a quantity is named as a

quality or profit of an item per unit. K-itemset is an itemset with a k number of

items. Utility of item i is calculated from the product of internal quantity and

external utility. Utility of itemset X is calculated from the sum of the utility

values of items of X. An itemset X is high if the utility of X is not less than its

minutil. The main task High Utility Itemset Mining is to find high utility values.

The definitions and notations for the basic terminology are presented as follows.

The Internal Utility IU(i, T) is the quantity associated with item i in a

transaction T. For example, item a in TID1 of Table1 is associated with 21,

hence IU(a, T1) = 2and the External Utility EU(i) is the quantity associated

with each item in the utility table 2.For example, item a of Table 2 is associated

with quantity 5, hence EU(a) is 5.


9

The Utility of Item UI(i,T) is the product of the internal and external utility

values of item i in Transaction T. for example, item a in T2, and the utility is UI

(b,T2)= IU (b,T2)*EU(b)=2*5=10.The utility of an itemset X is denoted as

UT(X,T), defined as the sum of the utility values of all the items of X in a

transaction T. for example, item set <ab> in T1, UT(<ab>,T)=UI (a,T1)+UI

(b,T1)=10+4=14.

The utility of X in database is denoted as UT(X,db), defined as the sum of the

utility values of the itemset in each transaction. UT(X,db)= 𝑈𝑇 𝑋, 𝑇 𝑋⊆𝑇∩𝑇⊆𝑑𝑏 .

For example, the utility of a in db is UT({a},db}=UT(a,T1)+UT(a,T2)+UT

(a,T3)+UT(a,T4)=2*5+0+1*5+0=15. HUIM is a defined as a process that

derives itemsets whose utility is not less than the minutil that is given by the

user. Unlike FIM, it does not support the measures used in FIM down-ward

closure and anti-monotonic property. Hence it takes more search space to

maintain all possible itemsets.

To reduce the memory space utility of a transaction is used to decide the upper

bound of the utility of item in a database db.The utility of a transaction is

denoted as UT(T), defined as the sum of the utilities of the all items presented in

a transaction T. In other words, if can be defined as follows 𝑈𝑇(𝑖𝑗𝑚𝑗=1 , 𝑇)

For example, UT(T1)= UI(a,T1) +UI(b,T1) +UI(C,T1)= 2*5+1*4+2*3=20. The

revised transaction database with transaction utility is presented in table 3.

Table 3: Sample Transaction Database

TID Items with Quantity Transaction Utility

1 (a,2)(b,1)(d,2) 20

2 (b,2)(c,1) 10

3 (a,1)(b,2)(c,3) 19

4 (b,1)(c,1)(d,2) 12

The utility of database db is denoted as UT(db), defined as the sum of utility of

the each transaction in database. In other words, it can be defined

as 𝑈𝑇(𝑇, 𝑑𝑏)𝑇∈𝑑𝑏 . For example, UT(db)=UT(T1,db) + UT(T2, db)+ UT(T3,db)

+UT(T4, db)=20+10+19+12=61.An itemset X is called high utility itemset iff

satisfy the conditionUT(X,db) ≥= 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 × 𝑈𝑇(𝑑𝑏) .In other words, it can

be defined it is high if the utility of an itemset in database is not less than

minutil.

For example, for a minutil=10%, UT({a},db)=UT(a,T1)+ UT(T3)=10+5=15>6,

hence it is high.


10

Problem Definition:

For a given transactional database TDB, external utility table, user interested

utility min_util, the aim of HUIM is to discover itemsets whose utility is greater

than or equal to the min_util.HUIM do not support pruning strategies of FIM.

To improve the performance of HUIM, Weighted Transaction Utility is

used.Weighted Transaction Utility of an itemset X is denoted as WTU(X,

TDB), defined as the sum of transaction utilities which contains itemset X. in

other words it can be defined as 𝑈𝑇 𝑇, 𝑇𝐷𝐵 .𝑋∈𝑇∩𝑇∈𝑇𝐷𝐵 WTU is introduced

to remove unnecessary itemsets using the High Weighted Transaction Utility

itemset, is denoted as HWTU (X), defined as it is called high if WTU(X,TDB)

is not less than minutil. It is used to define the upper bound utility of an itemset.

Hence the utility does not hold more than its WTU.𝑊𝑇𝑈 𝑋, 𝑇𝐷𝐵 ≥𝑚𝑖𝑛𝑢𝑡𝑖𝑙 × 𝑈𝑇 𝑇𝐷𝐵 . For example, consider itemset {a,b} in TDB,

WTU({a,b},TDB)= 20+19>=6, then it is said be high. If an itemset WTU is less

than minutil, then all of its supersets are not high.

3. HUIM Algorithms

Some of the techniques in HUIM are listed and classified based on their

methodology.

i. Apriori based approaches

ii. Tree based approaches

iii. Projection based approaches

iv. Hybrid approaches

Apriori based Approaches

The basic idea of HUIM is inspired from algorithm proposed by Chan et al. [2].

Yao H et al.[16] has designed Utility model that considers purchase quantities

and profits to High Utility Itemsets. The model name is Mining with Expected

Utility (MEU). The nature of HUIM does not maintain downward closure

property, Thus MEU model leads to huge memory. To address this issue, Liu et

al 2005 [12] proposed Two-phase algorithm. In first phase, it discovers all the

possible high utility itemsets in level manner using candidate generation

technique. In second phase, scans database to calculate the utility of the itemsets

and results the high utility itemsets. One of the main feature of two-phase

algorithm is Transaction Weighted Down-ward Closure (TWDC), is introduced

to reduce the search space by reducing the maximum upper bound to its TWDC.

TWDC helps in reducing the some of the unnecessary itemsets. However, huge

amount of possible itemsets are to be maintained. The reason was, TWDC is

calculated from the list of items that are presented along with the itemset, which

is loose upper bound. The final result may suffer from slower execution with

more memory. Apriori Based Algorithms and limitations are presented in

table4.


11

Table 4: Apriori based Approaches of HUIM

Year Algorithm Approach Limitations

2003 Top-k objective

directed data mining

[1]

Top-K objective-directed data

mining, which focuses on mining

the top-K high utility closed

patterns that directly support a

given business objective.

Since it doesn’t have mathematical

model, is not guarantee for all high

utility itemsets.

It is too specific.

2004 MEU[16] Mining with Expected Utility

(MEU)-Utility Model-Upper

bound

Do not support down-ward closure

property that leads to huge space.

2006 UMining, UMining-H

[15]

UMining is based on Apriori-

pruning strategy-1.

UMinining-H is based on Apriori

with pruning strategy-2.

Since they are apriori based approach

without down-ward closure, it may scan

database several times with too many

candidate itemsets.

2005 Two-Phase Algorithm

[12]

Two-Phase constructs HUIM in

two phases, where all the

candidates are explored in one

phase and checked in another

phase.

Although TWDC is used to reduce the

search space, since it is level-wise

approach, it may suffer from combinatorial

explosion of search space.

Tree based Approaches

To address the issues of Level-wise candidate generation based approaches of

HUIM, Tree based and Pattern-growth approaches are introduced. First

algorithm is CTU-Mine is designed by Erwin et al. 2007 [3]. It follows pattern-

growth concept to avoid candidate generation. A tree based IHUP is proposed to

overcome the issue of CTU-Mine to derive high utility itemsets efficiently

without candidate generation. It is also able to perform incremental mining

approach of HUIM.Lin et al.2011 designed HUP-tree [10] to represent entire

database in a compressed tree format. Thus, it helps to improve the efficiency of

HUIM process. Tseng et al.2010 [14] introduced more compressed tree UP-tree

that can represent all the itemsets in structure format. He has proposed UP-

Growth algorithm to mine High utility itemset using UP-tree. To overcome the

difficulties of UP-Growth, minimal node utility values are introduced in each

path of UP-tree, is named as UP-Growth+ algorithm. UP-Growth+ outperforms

all the other approaches in terms of both space and execution time. However, all

the mentioned algorithms suffer from too many candidates, because they are

following loose upper bound. Tree Based Algorithms and limitations presented

in table5.

Table 5: Tree based Approaches of HUIM


2007 CTU-Mine [3] CTU-Mine-based CP-Tree, which is used to

store itemsets in hierarchical manner. TWU pruning strategy to avoid unnecessary

computation.

Since it is based on CP-tree, it is

efficient on dense databases.

2010 UP-Growth [14] DLN, DLU, DGU pruning strategies are

proposed to discover HUIs. It constructs UP-tree to keep itemsets in a tree.

Since it is tree based approach, it

may suffer from too many recursive calls. Efficient on dense

databases.

2011 IHUP [10] Tree based approach for HUIM. It constructs HUP-tree to maintain itemsets in tree.

Since it is tree based approach, it may suffer from too many

recursive calls.


12

Projection based Pattern-growth Approaches

To overcome the limitations of tree based approaches that are required to visit

tree recursively to derive all the high utility itemsets. A projection-based

approach is used by Hong and Tseng 2003 [7]. It adopts a novel pruning

strategy and novel indexing mechanism to reduce the memory usage and speed

up the execution time. IHUP [3], Lan et al 2012 [9] have proposed pattern

growth based approaches to discover HUIs in a faster. For better understanding

the behavior of HUIM, PBAU (Projection-Based Average Utility) approach is

introduced to mine high average utility itemsets. It uses a measure called

average utility. Projection based pattern-growth approaches of HUIM,

algorithms and its limitations are presented in table 6.

Table 6: Projection based Approaches of HUIM


2003 CTU-Mine [3] CTU-Mine-based CP-Tree, which

is used to store itemsets in hierarchical manner. TWU pruning

strategy to avoid unnecessary computation.

Since it is based on CP-tree, it is efficient on dense databases.

2005 Two-Phase [14] Two-Phase constructs HUIM in

two phases, where all the

candidates are explored in one phase and checked in another

phase.

Although TWDC is used to reduce the search

space, since it is level-wise approach, it may suffer

from combinatorial explosion of search space.

2011 IHUP [10] Combination of Tree and pattern growth approach for HUIM. It

constructs HUP-tree to maintain

itemsets in tree.

Since it is tree based approach, it may suffer from too many recursive calls.

2012 PBAU [5] Projection based approach is used to derive High average utility

itemsets. It constructs projected

database recursively to derive HUIs.

Since it is projected based approach, it may suffer from too many projected databases.

Hybrid Approaches

The above approach either derives HUIs in K-passes or two passes. In real time

applications, it is needed that running algorithm should meet the application

data. Hence, it is motivated the researches to propose a Single Phase algorithms.

One of the algorithms is HUI-miner [11]. It uses Utility-list structure to keep the

item occurrence information and remaining utility information. Thus it reduces

memory and speed up the execution time. The HUI-miner outperforms UP-

Growth and UP-Growth+. However, it takes too many join operations to

generate high itemsets. To overcome the above limitation, Fourier Viger et al

[5] proposed FHM algorithm. It uses estimated utility co-occurrence structure

(EUCS) structure to avoid join operations. HUP-Miner [8] algorithm is also

proposed to reduce the join operations of utility item list. It includes the idea of

database partitioning and LA-Prune to stop utility calculation of unpromising

Itemsets early. Further reduction, d2HUP algorithm [11] is designed to avoid

unpromising itemsets using hybrid approach. It also uses hyper-structure

approach in further reducing memory. To discover HUIs more efficiently, EFIM

algorithm is proposed [17].Two novel measures revised sub-tree utility and

local utility are introduced to reduce the search space more efficiently. To


13

compute upper bounds in a linear space and time, a new counting mechanism

array-based counting mechanism is introduced. To reduce the database scans,

projection method HDP HTM (High Utility merging) is used in EFIM. Thus, it

outperforms other approaches by reducing execution and space to linear.

4. Experimental Results

The algorithms of HUIM mentioned above are tested against the standard

datasets that are presented in Table 7 with the following configuration.

Computer systems with the 1GB RAM, Java API.

Table 7: Characteristics of Datasets

Dataset No.of Items No.of Transactions AVG-leng Max-len Type

Chess 75 3196 37 37 Dense

Foodmart 1559 4141 4.4 14 Sparse

To show the performance of HUIM, FHM, Two-Phase, HUI-Miner algorithms

are considered among the several discussed above. For projecting performance,

execution time and memory usage are considered as parameters and it is

presented in the figure 1.

Figure 1: (a) Foodmart dataset: MinUtil vs Execution time (b) Chess dataset:

MinUtil-Execution time (c) Foodmart: MinUtil-Candidates (d) Foodmart:

MinUtil-memory usage(e) Chess: MinUtil-Candidates (f) Chess: MinUtil-

Memory Usage

5. Conclusion

Traditional Frequent Itemset Mining algorithms are designed for extracting

frequent itemsets from transactional databases. FIM does not give the patterns

that are profitable with less frequent. High Utility Itemset Mining is introduced

to derive the profitable itemsets without consideration of frequency. Methods

for deriving such kind of patterns extensively discussed in this paper, including

terminology, various methods with merits and demerits, variation of the HUIM

problem statements are presented. In the experiments, both sparse and dense

datasets were used for performance evaluation on HUIM, FHM, Two-Phase,


14

HUI-Miner algorithms. Results show that the performance in terms of execution

time, memory usage and the number of candidates. Although the techniques

suffer from huge search space and more execution time, these are attracted by

researchers for a variety of application and its need. This survey provides

various approaches and applications of HUIM.

References

[1] Ahmed C.F., Tanbeer S.K., Jeong B.S., Le Y.K., Efficient tree structures for high utility pattern mining in incremental databases, IEEE Transactions on Knowledge and Data Engineering 21(12) (2009), 1708–1721.

[2] Chan R., Yang Q., Shen Y.D., Mining high utility itemsets, Proceedings of the IEEE International Conference on Data Mining, Melbourne, FL (2003), 19–26.

[3] Fournier-Viger P., Lin J.C.W., Gueniche T., Barhate P., Efficient incremental high utility itemset mining, Proceedings of the ASE BigData & Social Informatics, Kaohsiung, Taiwan (2015).

[4] Fournier-Viger P., Lin, J.C.W., Vo B., Chi T.T., Zhang J., Le H.B., A survey of itemset mining, WIREs: Data Mining and Knowledge Discovery 7(4) (2017).

[5] Fournier-Viger P., Wu C.W., Zida S., Tseng V.S., FHM: Faster high-utility itemset mining using estimated utility co-occurrence pruning, Proceedings of the International symposium on methodologies for intelligent systems, Roskilde, Denmark, (2014), 83–92.

[6] Han J., Pei J., Yin Y., Mao, R., Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery 8(1) (2004), 53–87.

[7] Hong T.P., Lin K.Y., Chien B.C., Mining fuzzy multiple-level association rules from quantitative data, Applied Intelligence 18(1) (2003), 79–90.

[8] Krishnamoorthy S., Pruning strategies for mining high utility itemsets, Expert Systems with Applications 42(5) (2015), 2371–2381.

[9] Lan G.C., Hong T.P., Huang J.P., Tseng V.S.,) On-shelf utility mining with negative item values, Expert Systems with Applications 41(7) (2014), 3450–3459.

[10] Lin C.W., Hong T.P., Lu W.H., An effective tree structure for mining high utility itemsets, Expert Systems with Applications 38(6) (2011), 7419–7424.


15

[11] Liu M., Qu J., Mining high utility itemsets without candidate generation, Proceedings of the ACM International Conference on Information and Knowledge Management (2012), 55–64.

[12] Mai T., Vo B., Nguyen L.T.T., A lattice-based approach for mining high utility association rules, Information Sciences 399 (2017), 81–97.

[13] Pei J., Han J., Constrained frequent pattern mining: A pattern-growth view, ACM SIGKDD Explorations Newsletter 4(1) (2002), 31–39.

[14] Tseng V.S., Wu C.W., Shie B.E., Yu P.S., UP-Growth: An efficient algorithm for high utility itemset mining, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010), 253–262.

[15] Yao H., Hamilton H.J., Mining itemset utilities from transaction databases, Data & Knowledge Engineering 59(3) (2006), 603–626.

[16] Yao H., Hamilton H.J., Butz C.J., A foundational approach to mining itemset utilities from databases, Proceedings of the SIAM International Conference on Data Mining (2004), 211–225.

[17] Yun U., Ryang H., Lee G., Fujita H., An efficient algorithm for mining high utility patterns from incremental databases with one database scan, Knowledge-Based Systems 124 (2017), 188–206.


16

comprehensive study on high utility itemset s …comprehensive study on high utility itemset s...

Documents