comprehensive study on high utility itemset s …comprehensive study on high utility itemset s...
TRANSCRIPT
Comprehensive Study on High Utility Itemsets
Mining with Various Approaches and its Applications 1M. Suneetha and
2M.V.P. Chandra Sekhara Rao
1Research Scholar, Acharya Nagarjuna University.
Assistant Professor, Department of IT, GMRIT, Rajam, Srikakulam,
Andhra Pradesh, India. 2Professor, Department of CSE, RVR & JC College of Engineering,
Guntur, AP, India.
Abstract Discovering high-utility Itemsets in transaction databases is a popular
data mining task. High utility itemset mining addresses the limitations of
frequent itemset mining by introducing measures of interestingness that
reflect the significance of an itemset beyond its frequency of occurrence.
High utility itemset mining problem involves the use of internal and
external utilities of items to discover interesting patterns from a given
transactional database. This paper provides a survey of the HUIM
algorithms, including Apriori-based, tree-based, projection based and
Hybrid approaches. Experimental evaluations on both dense and sparse
datasets show that the performance of the HUIM, FHM, Two-Phase, HUI-
Miner algorithms in terms of execution time, memory usage and number of
candidates. Finally the characteristics and limitations are highlighted.
Key Words:minutil, itemsets, high utility itemset.
International Journal of Pure and Applied MathematicsVolume 119 No. 15 2018, 7-17ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
7
1. Introduction
In real world applications, to take strategic decisions, Data Mining techniques
are used to discover hidden and unknown knowledge from databases. One of
the basic examples is super market.DM techniques are used to identify the pair
of products that are frequently purchased for market promotion. The important
two fundamental tasks are Frequent Itemset Mining and Association Rule
Mining (ARM) [4]. Frequent Itemset Mining is a technique is used to find a pair
of itemsets whose occurrence is more than a given threshold. FIM helps in
finding the hidden associations among the itemsets in database. The basic
algorithms of FIM use candidate generation strategy to find the possible
frequent itemsets and database scan to result the frequent itemsets. Many
extensions have been proposed to carry out the mining process. One of the
extensions is pattern-growth approach[6], it takes two database scans to
discover frequent itemsets without generating candidate itemsets. To address the
issue of low interested patterns, Constraint-based pattern mining algorithms [13]
are introduced. Although traditional FIM algorithms are so popular, it can be
used to reveal knowledge in binary kind of databases, where the itemset is
present or not. An alternative is Quantitative Association rule mining proposed
and extensions are also proposed [7].FIM, ARM and QARM are does not
consider whether the itemsets are profitable or not. Some of the limitations of
FIM are, FPM algorithms successfully extract patterns from transactional
databases, and they consider attribute frequency known to be one. But a real
time application considers more than one attribute and the order among the
items.FPM techniques uses support based framework to derive frequent
patterns.
However the value of the pattern is important to the organization which is not
derived in FPM. For example, FIM based algorithms may miss the more
profitable products diamonds, cars which are rare but gives more profit.
Another example, if a sales analyst involved in some retail research needs to
find out which itemsets in the stores earn the maximum sales revenue for the
stores he or she will define the utility of any itemset as the monetary profit that
the store earns by selling each unit of that itemset.
To address the limitations of FIM, Utility is introduced into FPM to mine
patterns whose utility is high/more profit is called High Utility Pattern Mining
[16].
2. High Utility Itemset Mining (HUIM)
In High Utility Itemset Mining it takes transaction database as input, where
itemsets are associated with a quantity as internal utility, each item is also
associated with quality/profit as external utility. The utility can be measured in
terms of cost quality, profit, or other expressions of user preferences. The utility
of item is calculated from the product of quantity and profit of item in database.
International Journal of Pure and Applied Mathematics Special Issue
8
The utility of itemset is calculated from the sum of the utility of the items of
itemset in a database. The main task of High Utility Itemset Mining (HUIM) is
to discover itemsets whose utility value is not less than the user threshold
utility. To illustrate HUIM, Let us use the retail store example presented in table
1, database contains four transactions, the first column represents Transaction
id, in the second column shows a list of items {a,b,c,d} are paired with a
quantity value that represents amount of quantity purchased. Table 2 shows the
profit values of each item. The first transaction of Table2 shows item a is
associated with profit of 5.
Table 1: Sample Transaction Database
TID Items with Quantity
1 (a,2)(b,1)(d,2)
2 (b,2)(c,1)
3 (a,1)(b,2)(c,3)
4 (b,1)(c,1)(d,2)
Table 2: External Utility
Item a b c d
Profit 5 4 2 3
Utility Framework
This section starts with the problem statement; define the terminology,
techniques and algorithms that are used for deriving high utility itemsets.
Problem Statement:
Let I ={ i1, i2, … , in} be a set of distinct items. An itemset is a nonempty set
that contains one or more items, denoted as X=(x1, x2, … ,xn) where xi⊆I, ∀i= 1,
2, … , n. The size of itemset is denoted as |X| and it is the number of items in it.
For easiness, open brackets are omitted for the itemset, if it contains one item,
and the items that are presented in itemset are in lexicographical order.
Each item i in a data base is associated with quantity value is called with a name
internal utility. And each item is also associated with a quantity is named as a
quality or profit of an item per unit. K-itemset is an itemset with a k number of
items. Utility of item i is calculated from the product of internal quantity and
external utility. Utility of itemset X is calculated from the sum of the utility
values of items of X. An itemset X is high if the utility of X is not less than its
minutil. The main task High Utility Itemset Mining is to find high utility values.
The definitions and notations for the basic terminology are presented as follows.
The Internal Utility IU(i, T) is the quantity associated with item i in a
transaction T. For example, item a in TID1 of Table1 is associated with 21,
hence IU(a, T1) = 2and the External Utility EU(i) is the quantity associated
with each item in the utility table 2.For example, item a of Table 2 is associated
with quantity 5, hence EU(a) is 5.
International Journal of Pure and Applied Mathematics Special Issue
9
The Utility of Item UI(i,T) is the product of the internal and external utility
values of item i in Transaction T. for example, item a in T2, and the utility is UI
(b,T2)= IU (b,T2)*EU(b)=2*5=10.The utility of an itemset X is denoted as
UT(X,T), defined as the sum of the utility values of all the items of X in a
transaction T. for example, item set <ab> in T1, UT(<ab>,T)=UI (a,T1)+UI
(b,T1)=10+4=14.
The utility of X in database is denoted as UT(X,db), defined as the sum of the
utility values of the itemset in each transaction. UT(X,db)= 𝑈𝑇 𝑋, 𝑇 𝑋⊆𝑇∩𝑇⊆𝑑𝑏 .
For example, the utility of a in db is UT({a},db}=UT(a,T1)+UT(a,T2)+UT
(a,T3)+UT(a,T4)=2*5+0+1*5+0=15. HUIM is a defined as a process that
derives itemsets whose utility is not less than the minutil that is given by the
user. Unlike FIM, it does not support the measures used in FIM down-ward
closure and anti-monotonic property. Hence it takes more search space to
maintain all possible itemsets.
To reduce the memory space utility of a transaction is used to decide the upper
bound of the utility of item in a database db.The utility of a transaction is
denoted as UT(T), defined as the sum of the utilities of the all items presented in
a transaction T. In other words, if can be defined as follows 𝑈𝑇(𝑖𝑗𝑚𝑗=1 , 𝑇)
For example, UT(T1)= UI(a,T1) +UI(b,T1) +UI(C,T1)= 2*5+1*4+2*3=20. The
revised transaction database with transaction utility is presented in table 3.
Table 3: Sample Transaction Database
TID Items with Quantity Transaction Utility
1 (a,2)(b,1)(d,2) 20
2 (b,2)(c,1) 10
3 (a,1)(b,2)(c,3) 19
4 (b,1)(c,1)(d,2) 12
The utility of database db is denoted as UT(db), defined as the sum of utility of
the each transaction in database. In other words, it can be defined
as 𝑈𝑇(𝑇, 𝑑𝑏)𝑇∈𝑑𝑏 . For example, UT(db)=UT(T1,db) + UT(T2, db)+ UT(T3,db)
+UT(T4, db)=20+10+19+12=61.An itemset X is called high utility itemset iff
satisfy the conditionUT(X,db) ≥= 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 × 𝑈𝑇(𝑑𝑏) .In other words, it can
be defined it is high if the utility of an itemset in database is not less than
minutil.
For example, for a minutil=10%, UT({a},db)=UT(a,T1)+ UT(T3)=10+5=15>6,
hence it is high.
International Journal of Pure and Applied Mathematics Special Issue
10
Problem Definition:
For a given transactional database TDB, external utility table, user interested
utility min_util, the aim of HUIM is to discover itemsets whose utility is greater
than or equal to the min_util.HUIM do not support pruning strategies of FIM.
To improve the performance of HUIM, Weighted Transaction Utility is
used.Weighted Transaction Utility of an itemset X is denoted as WTU(X,
TDB), defined as the sum of transaction utilities which contains itemset X. in
other words it can be defined as 𝑈𝑇 𝑇, 𝑇𝐷𝐵 .𝑋∈𝑇∩𝑇∈𝑇𝐷𝐵 WTU is introduced
to remove unnecessary itemsets using the High Weighted Transaction Utility
itemset, is denoted as HWTU (X), defined as it is called high if WTU(X,TDB)
is not less than minutil. It is used to define the upper bound utility of an itemset.
Hence the utility does not hold more than its WTU.𝑊𝑇𝑈 𝑋, 𝑇𝐷𝐵 ≥𝑚𝑖𝑛𝑢𝑡𝑖𝑙 × 𝑈𝑇 𝑇𝐷𝐵 . For example, consider itemset {a,b} in TDB,
WTU({a,b},TDB)= 20+19>=6, then it is said be high. If an itemset WTU is less
than minutil, then all of its supersets are not high.
3. HUIM Algorithms
Some of the techniques in HUIM are listed and classified based on their
methodology.
i. Apriori based approaches
ii. Tree based approaches
iii. Projection based approaches
iv. Hybrid approaches
Apriori based Approaches
The basic idea of HUIM is inspired from algorithm proposed by Chan et al. [2].
Yao H et al.[16] has designed Utility model that considers purchase quantities
and profits to High Utility Itemsets. The model name is Mining with Expected
Utility (MEU). The nature of HUIM does not maintain downward closure
property, Thus MEU model leads to huge memory. To address this issue, Liu et
al 2005 [12] proposed Two-phase algorithm. In first phase, it discovers all the
possible high utility itemsets in level manner using candidate generation
technique. In second phase, scans database to calculate the utility of the itemsets
and results the high utility itemsets. One of the main feature of two-phase
algorithm is Transaction Weighted Down-ward Closure (TWDC), is introduced
to reduce the search space by reducing the maximum upper bound to its TWDC.
TWDC helps in reducing the some of the unnecessary itemsets. However, huge
amount of possible itemsets are to be maintained. The reason was, TWDC is
calculated from the list of items that are presented along with the itemset, which
is loose upper bound. The final result may suffer from slower execution with
more memory. Apriori Based Algorithms and limitations are presented in
table4.
International Journal of Pure and Applied Mathematics Special Issue
11
Table 4: Apriori based Approaches of HUIM
Year Algorithm Approach Limitations
2003 Top-k objective
directed data mining
[1]
Top-K objective-directed data
mining, which focuses on mining
the top-K high utility closed
patterns that directly support a
given business objective.
Since it doesn’t have mathematical
model, is not guarantee for all high
utility itemsets.
It is too specific.
2004 MEU[16] Mining with Expected Utility
(MEU)-Utility Model-Upper
bound
Do not support down-ward closure
property that leads to huge space.
2006 UMining, UMining-H
[15]
UMining is based on Apriori-
pruning strategy-1.
UMinining-H is based on Apriori
with pruning strategy-2.
Since they are apriori based approach
without down-ward closure, it may scan
database several times with too many
candidate itemsets.
2005 Two-Phase Algorithm
[12]
Two-Phase constructs HUIM in
two phases, where all the
candidates are explored in one
phase and checked in another
phase.
Although TWDC is used to reduce the
search space, since it is level-wise
approach, it may suffer from combinatorial
explosion of search space.
Tree based Approaches
To address the issues of Level-wise candidate generation based approaches of
HUIM, Tree based and Pattern-growth approaches are introduced. First
algorithm is CTU-Mine is designed by Erwin et al. 2007 [3]. It follows pattern-
growth concept to avoid candidate generation. A tree based IHUP is proposed to
overcome the issue of CTU-Mine to derive high utility itemsets efficiently
without candidate generation. It is also able to perform incremental mining
approach of HUIM.Lin et al.2011 designed HUP-tree [10] to represent entire
database in a compressed tree format. Thus, it helps to improve the efficiency of
HUIM process. Tseng et al.2010 [14] introduced more compressed tree UP-tree
that can represent all the itemsets in structure format. He has proposed UP-
Growth algorithm to mine High utility itemset using UP-tree. To overcome the
difficulties of UP-Growth, minimal node utility values are introduced in each
path of UP-tree, is named as UP-Growth+ algorithm. UP-Growth+ outperforms
all the other approaches in terms of both space and execution time. However, all
the mentioned algorithms suffer from too many candidates, because they are
following loose upper bound. Tree Based Algorithms and limitations presented
in table5.
Table 5: Tree based Approaches of HUIM
Year Algorithm Approach Limitations
2007 CTU-Mine [3] CTU-Mine-based CP-Tree, which is used to
store itemsets in hierarchical manner. TWU pruning strategy to avoid unnecessary
computation.
Since it is based on CP-tree, it is
efficient on dense databases.
2010 UP-Growth [14] DLN, DLU, DGU pruning strategies are
proposed to discover HUIs. It constructs UP-tree to keep itemsets in a tree.
Since it is tree based approach, it
may suffer from too many recursive calls. Efficient on dense
databases.
2011 IHUP [10] Tree based approach for HUIM. It constructs HUP-tree to maintain itemsets in tree.
Since it is tree based approach, it may suffer from too many
recursive calls.
International Journal of Pure and Applied Mathematics Special Issue
12
Projection based Pattern-growth Approaches
To overcome the limitations of tree based approaches that are required to visit
tree recursively to derive all the high utility itemsets. A projection-based
approach is used by Hong and Tseng 2003 [7]. It adopts a novel pruning
strategy and novel indexing mechanism to reduce the memory usage and speed
up the execution time. IHUP [3], Lan et al 2012 [9] have proposed pattern
growth based approaches to discover HUIs in a faster. For better understanding
the behavior of HUIM, PBAU (Projection-Based Average Utility) approach is
introduced to mine high average utility itemsets. It uses a measure called
average utility. Projection based pattern-growth approaches of HUIM,
algorithms and its limitations are presented in table 6.
Table 6: Projection based Approaches of HUIM
Year Algorithm Approach Limitations
2003 CTU-Mine [3] CTU-Mine-based CP-Tree, which
is used to store itemsets in hierarchical manner. TWU pruning
strategy to avoid unnecessary computation.
Since it is based on CP-tree, it is efficient on dense databases.
2005 Two-Phase [14] Two-Phase constructs HUIM in
two phases, where all the
candidates are explored in one phase and checked in another
phase.
Although TWDC is used to reduce the search
space, since it is level-wise approach, it may suffer
from combinatorial explosion of search space.
2011 IHUP [10] Combination of Tree and pattern growth approach for HUIM. It
constructs HUP-tree to maintain
itemsets in tree.
Since it is tree based approach, it may suffer from too many recursive calls.
2012 PBAU [5] Projection based approach is used to derive High average utility
itemsets. It constructs projected
database recursively to derive HUIs.
Since it is projected based approach, it may suffer from too many projected databases.
Hybrid Approaches
The above approach either derives HUIs in K-passes or two passes. In real time
applications, it is needed that running algorithm should meet the application
data. Hence, it is motivated the researches to propose a Single Phase algorithms.
One of the algorithms is HUI-miner [11]. It uses Utility-list structure to keep the
item occurrence information and remaining utility information. Thus it reduces
memory and speed up the execution time. The HUI-miner outperforms UP-
Growth and UP-Growth+. However, it takes too many join operations to
generate high itemsets. To overcome the above limitation, Fourier Viger et al
[5] proposed FHM algorithm. It uses estimated utility co-occurrence structure
(EUCS) structure to avoid join operations. HUP-Miner [8] algorithm is also
proposed to reduce the join operations of utility item list. It includes the idea of
database partitioning and LA-Prune to stop utility calculation of unpromising
Itemsets early. Further reduction, d2HUP algorithm [11] is designed to avoid
unpromising itemsets using hybrid approach. It also uses hyper-structure
approach in further reducing memory. To discover HUIs more efficiently, EFIM
algorithm is proposed [17].Two novel measures revised sub-tree utility and
local utility are introduced to reduce the search space more efficiently. To
International Journal of Pure and Applied Mathematics Special Issue
13
compute upper bounds in a linear space and time, a new counting mechanism
array-based counting mechanism is introduced. To reduce the database scans,
projection method HDP HTM (High Utility merging) is used in EFIM. Thus, it
outperforms other approaches by reducing execution and space to linear.
4. Experimental Results
The algorithms of HUIM mentioned above are tested against the standard
datasets that are presented in Table 7 with the following configuration.
Computer systems with the 1GB RAM, Java API.
Table 7: Characteristics of Datasets
Dataset No.of Items No.of Transactions AVG-leng Max-len Type
Chess 75 3196 37 37 Dense
Foodmart 1559 4141 4.4 14 Sparse
To show the performance of HUIM, FHM, Two-Phase, HUI-Miner algorithms
are considered among the several discussed above. For projecting performance,
execution time and memory usage are considered as parameters and it is
presented in the figure 1.
Figure 1: (a) Foodmart dataset: MinUtil vs Execution time (b) Chess dataset:
MinUtil-Execution time (c) Foodmart: MinUtil-Candidates (d) Foodmart:
MinUtil-memory usage(e) Chess: MinUtil-Candidates (f) Chess: MinUtil-
Memory Usage
5. Conclusion
Traditional Frequent Itemset Mining algorithms are designed for extracting
frequent itemsets from transactional databases. FIM does not give the patterns
that are profitable with less frequent. High Utility Itemset Mining is introduced
to derive the profitable itemsets without consideration of frequency. Methods
for deriving such kind of patterns extensively discussed in this paper, including
terminology, various methods with merits and demerits, variation of the HUIM
problem statements are presented. In the experiments, both sparse and dense
datasets were used for performance evaluation on HUIM, FHM, Two-Phase,
International Journal of Pure and Applied Mathematics Special Issue
14
HUI-Miner algorithms. Results show that the performance in terms of execution
time, memory usage and the number of candidates. Although the techniques
suffer from huge search space and more execution time, these are attracted by
researchers for a variety of application and its need. This survey provides
various approaches and applications of HUIM.
References
[1] Ahmed C.F., Tanbeer S.K., Jeong B.S., Le Y.K., Efficient tree structures for high utility pattern mining in incremental databases, IEEE Transactions on Knowledge and Data Engineering 21(12) (2009), 1708–1721.
[2] Chan R., Yang Q., Shen Y.D., Mining high utility itemsets, Proceedings of the IEEE International Conference on Data Mining, Melbourne, FL (2003), 19–26.
[3] Fournier-Viger P., Lin J.C.W., Gueniche T., Barhate P., Efficient incremental high utility itemset mining, Proceedings of the ASE BigData & Social Informatics, Kaohsiung, Taiwan (2015).
[4] Fournier-Viger P., Lin, J.C.W., Vo B., Chi T.T., Zhang J., Le H.B., A survey of itemset mining, WIREs: Data Mining and Knowledge Discovery 7(4) (2017).
[5] Fournier-Viger P., Wu C.W., Zida S., Tseng V.S., FHM: Faster high-utility itemset mining using estimated utility co-occurrence pruning, Proceedings of the International symposium on methodologies for intelligent systems, Roskilde, Denmark, (2014), 83–92.
[6] Han J., Pei J., Yin Y., Mao, R., Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery 8(1) (2004), 53–87.
[7] Hong T.P., Lin K.Y., Chien B.C., Mining fuzzy multiple-level association rules from quantitative data, Applied Intelligence 18(1) (2003), 79–90.
[8] Krishnamoorthy S., Pruning strategies for mining high utility itemsets, Expert Systems with Applications 42(5) (2015), 2371–2381.
[9] Lan G.C., Hong T.P., Huang J.P., Tseng V.S.,) On-shelf utility mining with negative item values, Expert Systems with Applications 41(7) (2014), 3450–3459.
[10] Lin C.W., Hong T.P., Lu W.H., An effective tree structure for mining high utility itemsets, Expert Systems with Applications 38(6) (2011), 7419–7424.
International Journal of Pure and Applied Mathematics Special Issue
15
[11] Liu M., Qu J., Mining high utility itemsets without candidate generation, Proceedings of the ACM International Conference on Information and Knowledge Management (2012), 55–64.
[12] Mai T., Vo B., Nguyen L.T.T., A lattice-based approach for mining high utility association rules, Information Sciences 399 (2017), 81–97.
[13] Pei J., Han J., Constrained frequent pattern mining: A pattern-growth view, ACM SIGKDD Explorations Newsletter 4(1) (2002), 31–39.
[14] Tseng V.S., Wu C.W., Shie B.E., Yu P.S., UP-Growth: An efficient algorithm for high utility itemset mining, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010), 253–262.
[15] Yao H., Hamilton H.J., Mining itemset utilities from transaction databases, Data & Knowledge Engineering 59(3) (2006), 603–626.
[16] Yao H., Hamilton H.J., Butz C.J., A foundational approach to mining itemset utilities from databases, Proceedings of the SIAM International Conference on Data Mining (2004), 211–225.
[17] Yun U., Ryang H., Lee G., Fujita H., An efficient algorithm for mining high utility patterns from incremental databases with one database scan, Knowledge-Based Systems 124 (2017), 188–206.
International Journal of Pure and Applied Mathematics Special Issue
16
17
18