a new method for mining frequent weighted itemsets based on wit-trees

9
A new method for mining Frequent Weighted Itemsets based on WIT-trees Bay Vo a , Frans Coenen b , Bac Le c,a Department of Computer Science, Information Technology College, Ho Chi Minh, Viet Nam b Department of Computer Science, University of Liverpool, UK c Department of Computer Science, University of Science, Ho Chi Minh, Viet Nam article info Keywords: Data mining Frequent Weighted Itemset Frequent weighted support WIT-trees WIT-FWI abstract The mining frequent itemsets plays an important role in the mining of association rules. Frequent item- sets are typically mined from binary databases where each item in a transaction may have a different sig- nificance. Mining Frequent Weighted Itemsets (FWI) from weighted items transaction databases addresses this issue. This paper therefore proposes algorithms for the fast mining of FWI from weighted item transaction databases. Firstly, an algorithm for directly mining FWI using WIT-trees is presented. After that, some theorems are developed concerning the fast mining of FWI. Based on these theorems, an advanced algorithm for mining FWI is proposed. Finally, a Diffset strategy for the efficient computa- tion of the weighted support for itemsets is described, and an algorithm for mining FWI using Diffsets presented. A complete evaluation of the proposed algorithms is also presented. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction Association Rule Mining (ARM) is an important element within the domain of Knowledge Discovery in Data (KDD) (Agrawal & Srikant, 1994; Vo, Hong, and Le, 2012). ARM is used to identify relationships amongst items in transaction databases. Given a set of items I ={i 1 , i 2 , ... , i n }, a transaction is defined as a subset of I. The input to an ARM algorithm is a dataset D comprising a set of transactions. Given an itemset X # I, the support of X in D, denoted as r (X), is the number of transactions in D which contain X. An itemset is described as being frequent if its support is larger than or equal to a user supplied minimum support threshold (minSup). A ‘‘classical’’ Association Rule (AR) is an expression of the form {X ? Y (sup, conf)}, where X, Y # I and X\Y= Ø. The support of this rule is sup = r (XY) and the confidence is conf ¼ rðXYÞ rðXÞ . Given a spe- cific minSup and a minimum confidence threshold (minConf), we want to mine all association rules whose support and confidence exceeds minSup and minConf respectively. However, ‘‘Classical’’ ARM does not take into consideration the relative benefit or significance of items. With respect to some applications, we are interested in the relative benefit (weighted va- lue) associated with each item. For example the sale of bread may incur a profit of 20 cents while a bottle of milk might realize a prof- it of 40 cents. It is thus desirable to identify new methods for applying ARM techniques to this kind data so that such relative benefits are taken into account. Ramkumar, Ranka, and Tsur (1998) (see also Cai, Fu, Cheng, & Kwong, 1998) proposed a model for describing the concept of Weighted Association Rules (WAR) and presented an Apriori-based algorithm for mining Frequent Weighted Itemsets (FWI). Since then many Weighted Association Rule Mining (WARM) techniques have been proposed (see for example Wang, Yang, and Yu (2000) and Tao, Murtagh, and Farid (2003)). The purpose of this paper is to develop algorithms for the fast mining of FWI. Firstly, some theorems and corollary are proposed. Based on these theorems, and using WIT-trees (Le, Nguyen, Cao, & Vo, 2009; Le, Nguyen, & Vo, 2011), we present an algorithm for the fast mining of FWI. After that, we apply the Diffset strategy (Zaki & Gouda, 2003; Zaki & Hsiao, 2005) and extend the originally pro- posed FWI mining algorithm. The rest of this paper is organized as follows. Section 2 presents some related work concerning the mining of FWI and WAR. Section 3 presents a proposed modification of the WIT-tree data structure (Le et al., 2009, 2011) for compressing the database into a tree structure. Algorithms for mining FWI using WIT-trees are discussed in Section 4. Some experimental results are present in Section 5, and some conclusions in Section 6. 2. Related work This section presents some related works. The section com- mences with a formal definition of weighted transaction databases. The Galois connection, used later in this paper to prove a number of theorems, is then reviewed in Section 2.2. Next, in Section 2.3, some definitions related to weighted association rules are presented. 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.08.065 Corresponding author. E-mail addresses: [email protected] (B. Vo), [email protected] (F. Coenen), lhbac@fit.hcmus.edu.vn (B. Le). Expert Systems with Applications 40 (2013) 1256–1264 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: bac

Post on 11-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A new method for mining Frequent Weighted Itemsets based on WIT-trees

Expert Systems with Applications 40 (2013) 1256–1264

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A new method for mining Frequent Weighted Itemsets based on WIT-trees

Bay Vo a, Frans Coenen b, Bac Le c,⇑a Department of Computer Science, Information Technology College, Ho Chi Minh, Viet Namb Department of Computer Science, University of Liverpool, UKc Department of Computer Science, University of Science, Ho Chi Minh, Viet Nam

a r t i c l e i n f o a b s t r a c t

Keywords:Data miningFrequent Weighted ItemsetFrequent weighted supportWIT-treesWIT-FWI

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.eswa.2012.08.065

⇑ Corresponding author.E-mail addresses: [email protected] (B. Vo), Coenen

[email protected] (B. Le).

The mining frequent itemsets plays an important role in the mining of association rules. Frequent item-sets are typically mined from binary databases where each item in a transaction may have a different sig-nificance. Mining Frequent Weighted Itemsets (FWI) from weighted items transaction databasesaddresses this issue. This paper therefore proposes algorithms for the fast mining of FWI from weighteditem transaction databases. Firstly, an algorithm for directly mining FWI using WIT-trees is presented.After that, some theorems are developed concerning the fast mining of FWI. Based on these theorems,an advanced algorithm for mining FWI is proposed. Finally, a Diffset strategy for the efficient computa-tion of the weighted support for itemsets is described, and an algorithm for mining FWI using Diffsetspresented. A complete evaluation of the proposed algorithms is also presented.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Association Rule Mining (ARM) is an important element withinthe domain of Knowledge Discovery in Data (KDD) (Agrawal &Srikant, 1994; Vo, Hong, and Le, 2012). ARM is used to identifyrelationships amongst items in transaction databases. Given a setof items I = {i1, i2, . . . , in}, a transaction is defined as a subset of I.The input to an ARM algorithm is a dataset D comprising a set oftransactions. Given an itemset X # I, the support of X in D, denotedas r (X), is the number of transactions in D which contain X. Anitemset is described as being frequent if its support is larger thanor equal to a user supplied minimum support threshold (minSup).A ‘‘classical’’ Association Rule (AR) is an expression of the form{X ? Y (sup,conf)}, where X, Y # I and X\Y = Ø. The support of thisrule is sup = r (XY) and the confidence is conf ¼ rðXYÞ

rðXÞ . Given a spe-cific minSup and a minimum confidence threshold (minConf), wewant to mine all association rules whose support and confidenceexceeds minSup and minConf respectively.

However, ‘‘Classical’’ ARM does not take into consideration therelative benefit or significance of items. With respect to someapplications, we are interested in the relative benefit (weighted va-lue) associated with each item. For example the sale of bread mayincur a profit of 20 cents while a bottle of milk might realize a prof-it of 40 cents. It is thus desirable to identify new methods forapplying ARM techniques to this kind data so that such relativebenefits are taken into account.

ll rights reserved.

@liverpool.ac.uk (F. Coenen),

Ramkumar, Ranka, and Tsur (1998) (see also Cai, Fu, Cheng, &Kwong, 1998) proposed a model for describing the concept ofWeighted Association Rules (WAR) and presented an Apriori-basedalgorithm for mining Frequent Weighted Itemsets (FWI). Sincethen many Weighted Association Rule Mining (WARM) techniqueshave been proposed (see for example Wang, Yang, and Yu (2000)and Tao, Murtagh, and Farid (2003)).

The purpose of this paper is to develop algorithms for the fastmining of FWI. Firstly, some theorems and corollary are proposed.Based on these theorems, and using WIT-trees (Le, Nguyen, Cao, &Vo, 2009; Le, Nguyen, & Vo, 2011), we present an algorithm for thefast mining of FWI. After that, we apply the Diffset strategy (Zaki &Gouda, 2003; Zaki & Hsiao, 2005) and extend the originally pro-posed FWI mining algorithm.

The rest of this paper is organized as follows. Section 2 presentssome related work concerning the mining of FWI and WAR.Section 3 presents a proposed modification of the WIT-tree datastructure (Le et al., 2009, 2011) for compressing the database intoa tree structure. Algorithms for mining FWI using WIT-trees arediscussed in Section 4. Some experimental results are present inSection 5, and some conclusions in Section 6.

2. Related work

This section presents some related works. The section com-mences with a formal definition of weighted transaction databases.The Galois connection, used later in this paper to prove a numberof theorems, is then reviewed in Section 2.2. Next, in Section 2.3,some definitions related to weighted association rules arepresented.

Page 2: A new method for mining Frequent Weighted Itemsets based on WIT-trees

Table 3Transaction weights for transactions in Table 1.

Transactions tw

1 0.452 0.23 0.454 0.35 0.426 0.43Sum 2.25

B. Vo et al. / Expert Systems with Applications 40 (2013) 1256–1264 1257

2.1. Weighted items transaction databases

A weighted transaction database (D) is defined as follows: Dcomprises a set of transactions T = {t1, t2, . . . , tm}, a set of itemsI = {i1, i2, . . . , in} and a set of positive weights W = {w1, w2, . . . ,wn}corresponding to each item in I.

For example, consider the data presented in Tables 1 and 2. Ta-ble 1 presents a data set comprising six transactions T = {t1, . . . , t6},and five items I = {A,B,C,D,E}. The weights of these items are pre-sented in Table 2, W = {0.6,0.1,0.3,0.9,0.2}.

2.2. Galois connection

Let d # I � T be a binary relation, where I is a set of items and Tis a set of transactions contained in a database D. Let P(S) (thepower set of S) include all subsets of S. Two mappings betweenP(I) and P(T) are called Galois connections as follows (Zaki, 2004).

Let X # I and Y # T, we have:

I. t : PðIÞ# PðTÞ; tðXÞ ¼ fy 2 Tj8x 2 X; xdygII. i : PðTÞ# PðIÞ; iðYÞ ¼ fx 2 Ij8y 2 Y; xdyg

The mapping t(X) is the set of transactions in the databasewhich contain X, and the mapping i(Y) is an itemset that is con-tained in all the transactions Y.

Given X, X1, X2 e P(I) and Y, Y1, Y2 e P(T). The Galois connectionsatisfies the following properties (Zaki, 2004):

(i) X1 � X2) t(X1) � t(X2)(ii) Y1 � Y2) i(Y1) � i(Y2)

(iii) X # i(t(X)) and Y # t(i(Y))

2.3. Mining frequent weighted itemsets

Definition 2.1. The transaction weight (tw) of a transaction tk isdefined as follows:

twðtkÞ ¼P

ij2tkwj

jtkjð2:1Þ

Definition 2.2. The weighted support of an itemset is defined asfollows:

wsðXÞ ¼P

tk2tðXÞtwðtkÞPtk2T twðtkÞ

ð2:2Þ

where T is the list of transactions in the database.

Table 1The transaction database.

Transactions Bought items

1 A, B, D, E2 B, C, E3 A, B, D, E4 A, B, C, E5 A, B, C, D, E6 B, C, D

Table 2Items weight.

Item Weight

A 0.6B 0.1C 0.3D 0.9E 0.2

Example 2.1. Consider Tables 1 and 2, and Definition 2.1, we cancompute the tw(t1) value as follow:

twðt1Þ ¼0:6þ 0:1þ 0:9þ 0:2

4¼ 0:45

Table 3 shows all tw values of transactions in Table 1.

From Tables 1 and 3, and Definition 2.2, we can compute thews(BD) value as follows: Because BD appears in transactions{1,3,5,6}, ws(BD) is computed:

wsðBDÞ ¼ 0:45þ 0:45þ 0:42þ 0:432:25

� 0:78

The mining of FWI requires the identification of all itemsets whoseweighted support satisfies an user specified minimum weightedsupport threshold (minws), i.e. FWI = {X # I| ws(X) P minws}.

Theorem 2.1. The use of the weighted support metric describedabove satisfies the downward closure property. i.e., if X � Y thenws(X) P ws(Y).

Proof. Because X � Y, according to the property (i) of Galois con-nection, we have t(X) � t(Y)

)X

tk2tðXÞtwðtkÞP

X

tk2tðYÞtwðtkÞ )

Ptk2tðXÞtwðtkÞP

tk2T twðtkÞP

Ptk2tðYÞtwðtkÞP

tk2T twðtkÞ

) wsðXÞP wsðYÞ:

h

To mine WAR, we must first mine all FWI that satisfy the mini-mum weighted support threshold. The mining of FWI is the mostcomputationally expensive element of WAR mining. Ramkumaret al. (1998) proposed an Apriori-based algorithm for miningFWI. This approach requires many scans of the whole databaseto determine the weighted support of itemsets. Some other stud-ies used this approach for generating WAR (Tao et al., 2003;Wang et al., 2000).

3. WIT-tree data structure

We proposed the WIT-tree (Weighted Itemset-Tidset tree)data structure, an expansion of the IT-tree proposed in Zakiand Hsiao (2005), to support the mining of high utility itemsets.The WIT-tree data structure provides for a representation of theinput data (so that we only need scan the database once), com-prising of itemset TID lists, that supports the fast computation ofweighted support values. Each node in a WIT-tree includes 3fields:

I. X: an itemset.II. t(X): the set of transactions contains X.

III. ws: the weighted support of X.

Page 3: A new method for mining Frequent Weighted Itemsets based on WIT-trees

1258 B. Vo et al. / Expert Systems with Applications 40 (2013) 1256–1264

The node is denoted using a tuple of the form hX, t(X), wsi.The value for ws is computed by summing all tw values of trans-

actions, t(X), which their tids belong to and then dividing this bythe sum of all tw values. Thus, computing of ws is founded on Tid-set. The links connect nodes at kth level (called X) with nodes at the(k + 1)th level (called Y).

Definition 3.1. (Zaki & Hsiao, 2005)-The equivalence class

Let I be a set of items and X # I, a function p(X,k) = X[1:k] as thek length prefix of X and a prefix-based equivalence relation hK onitemsets as follows: 8X;Y # I; X�hk Y () pðX; kÞ ¼ pðY; kÞ.

The set of all itemsets having the same prefix X is called anequivalence class, and is denoted as the equivalence class with pre-fix X is [X].

Example 3.1. Consider Tables 1 and 3 above, the associated WIT-tree for mining frequent weighted itemsets is as presented in Fig. 1.

The root node of the WIT-tree contains all 1-itemset nodes. Allnodes in level 1 belong to the same equivalence class with prefix {}(or [£]). Each node in level 1 will become a new equivalence classusing its item as the prefix. With each node in the same prefix, itwill join with all nodes following it to create a new equivalenceclass. The process will be done recursively to create new equiva-lence classes in higher levels. For example, considering Fig. 1,nodes {A}, {B}, {C}, {D}, {E} belong to the equivalence class [£].Consider node {A}, this node will join with all nodes following it({B}, {C}, {D}, {E}) to create a new equivalence class [A] = {{AB}, {A-C}, {AD}, {AE}}. [AB] will become a new equivalence class by alsojoining with all nodes following it ({AC}, {AD}, {AE}); and so on.

Inspection of Fig. 1 indicates that all itemsets satisfy the down-ward closure property. Thus, we can prune an equivalence class inthe WIT-tree if its ws value does not satisfy the minws. For exam-ple, suppose that minwus = 0.4, because ws(ABC) = 0.32 < minwswe can prune the equivalence class with the prefix ABC, i.e., allchild nodes of ABC can be pruned.

Fig. 1. Search tree u

4. Mining frequent weighted itemsets

In this section, we propose algorithms for mining FWI fromweighted transaction databases. Firstly, an algorithm for directlymining FWI from WIT-trees is presented. It uses a minws thresholdand the downward closure property to prune nodes that are not fre-quent. Some theorems are then derived and based on these theo-rems, an improved algorithm is proposed. Finally, the algorithmis further developed, by adopting a Diffset strategy to allow forthe fast computation of the weighted support of itemsets in amemory efficient manner.

4.1. WIT-FWI algorithm

In this sub-section, an algorithm for mining FWI using WIT-trees will be present. It is founded on the downward closure prop-erty to prune nodes that are not frequent.

In function WIT-FWI (Fig. 2), let Lr contains all single items thattheir weighted supports satisfy minimum weighted support (line1). Nodes in Lr are stored in increasing order according to theirweighted support (line 2). After that, the set of FWI is set to null(line 3). Finally, the FWI-EXTEND function will be called to mineall FWI (line 4).

Consider function FWI-EXTEND: This function considers eachnode li in Lr with all nodes following it to create a set of nodes Li

(lines 5 and 7). The way to create Li as follows: Firstly, letX = li.itemset [ lj.itemset, it computes Y = t(X) = t(li) \ t(lj) (line 8),if ws(X) (computed through t(X) using Eq. (2.2), line 9) satisfiesminws (line 10), we add new node <X,Y,ws(X)> into Li (line 11).After creating Li, function FWI-EXTEND is called recursively toprocess with the input is Li (line 13) if number of nodes in Li greaterthan 1. Function COMPUTE-WS(Y) uses the Eq. (2.2) to computethe ws of itemset X based on the values in Table 3 with Y = t(X)(line 12).

Example 4.1. Using the data presented in Tables 1 and 3, and withreference to Fig. 2, an illustration of the WIT-FWI algorithm withminws = 0.4 is as follows:

sing WIT-tree.

Page 4: A new method for mining Frequent Weighted Itemsets based on WIT-trees

Fig. 2. WIT-FWI algorithm for mining frequent weighted itemsets.

B. Vo et al. / Expert Systems with Applications 40 (2013) 1256–1264 1259

We commence by computing the ws values of the single items.We have ws(A) = 0.72, ws(B) = 1.0, ws(C) = 0.60, ws(D) = 0.78,ws(E) = 0.81. All ws values of single items satisfy minws)Lr = {<A,1345,0.72>, <B,123456,1.0>, <C,2456,0.6>, <D,1356,0.78>, <E,12345,0.81>}. After sorting them according to theirweighted supports, we have Lr = {<C,2456,0.6>,<A,1345,0.72>,<D,1356,0.78>,<E,12345,0.81>,<B,123456,1.0>}.

Consider node <A,1345,0.72>: First of all, A is added toFWI = {C,CE,CEB,CB}) FWI = {C,CE,CEB,CB,A}.

o A joins D, we have a new itemset AD with t(AD) = 135 andws(AD) = 0.59, so add AD into LA) LA = {<AD,135,0.59>}.

o A joins E, we have a new itemset AE with t(AE) = 1345 andws(AE) = 0.72 ) Add node AE into LA) LA = {<AD,135,0.59>,<AE,1345,0.72>}.

o A joins B, we have a new itemset AB with t(AB) = 1345 andws(AB) = 0.72 P minws ) Add node AB into LA) [A] = {<AD,135,0.59>,<AE,1345,0.72>,<AB,1345,0.72>}.

After creating the set LA, and because the number of nodes in LA

is larger than 1, the function will be called (in a recursive manner)to create all child nodes of LA.

s Consider node <AD,135,0.59>:

Add AD to FWI) FWI = {C,CE,CEB,CB,A,AD}.AD joins AE, we have a new itemset ADE � 135 withws(ADE) = 0.59, so add ADE to [AD]) [AD] = {ADE}.AD joins AB, we have a new itemset ADB � 135 withwus(ADB) = 0.59, so add ADB to [AD]) [AD] = {ADE, ADB}.j Consider node <ADE,1345,0.72>: Add ADE to FWI)

FWI = {C, CE, CEB, CB, A, AD, ADE}.j ADE joins ADB into a new itemset ADEB � 135 with wus

(ADEB) = 0.59) [ADE] = {ADEB}. Because the equivalenceclass [ADE] has only one element ) there is no anyequivalence class created in following it.

Similar to nodes <C,2456,0.6>, <D,1356,0.78>, <E,12345,0.81>,

<B,123456,1.0>.

Finally, we have the set of all FWI that satisfy minwus = 0.4 isFWI = {C,CE,CEB,CB,A,AD,ADE,ADEB,ADB,AE,AEB,AB,D,DE,DEB,D-B,E,EB,B} as in Fig. 3.

4.2. An improved algorithm

From Fig. 2, we can see that with respect to some nodes weneed not compute the weighted support because this can be ob-tained from the parent nodes. For example, nodes AE, AB and AEBhave the same weighted support as node A; node ADE, ADB andADEB have the same weighted support as node AD; and so on.

Theorem 4.1. Given two itemsets X and Y, if t(X) = t(Y) thenws(X) = ws(Y)

Proof. Because tðXÞ ¼ tðYÞ )P

tk2tðXÞtwðtkÞ ¼P

tk2tðYÞtwðtkÞ )Ptk2tðXÞ

twðtkÞPtk2T

twðtkÞ¼P

tk2tðYÞtwðtkÞP

tk2TtwðtkÞ

or ws(X) = ws(Y). h

Corollary 4.1. If X � Y and |t(X)| = |t(Y)| then ws(X) = ws(Y)

Proof. If X � Y, we have t(X) � t(Y) (according to the property (i) ofthe Galois connection). Besides, because |t(X)| = |t(Y)|) t(X) = -t(Y))ws(X) = ws(Y) according to the Theorem 4.1. h

Based on Corollary 4.1, we developed an algorithm for miningFWI by introducing some modifications to the algorithm presentedin the Section 4.1. When we join two nodes li, lj of Lr to create a newnode li [ lj, if |t(li)| = |t(li [ lj)| then ws(li [ lj) = ws(li), we need notcompute the ws value of li [ lj. Similarly, if |t(lj)| = |t(li [ lj)| thenws(li [ lj) = ws(li), we need not also compute the ws value of li [ lj.

The algorithm in Fig. 2 is modified as follow: Line 9 (in Fig. 2) ischanged by 3 lines (from lines 9 to 11 in Fig. 4). In line 9, we have

Page 5: A new method for mining Frequent Weighted Itemsets based on WIT-trees

Fig. 3. WIT-tree for mining FWI from the databases in Tables 1 and 3 with minws = 0.4.

1260 B. Vo et al. / Expert Systems with Applications 40 (2013) 1256–1264

li.itemset � X, according to Corollary 4.1, if |t(li)| = |Y| thenws(X) = ws(li). Similarly, if |t(lj)| = |Y| then ws(X) = ws(lj) (line 11).

Example 4.2. Consider the WIT-tree presented in Fig. 3, when wejoin node <A,1345,0.72> with node <B,123456,1.0> to create newnode <AB,1345,?>, because of |t(A)| = |1345| = |t(AB)|)ws(AB) =ws(A) = 0.72. Similarly, |t(C)| = |2456| = |t(BC)|)ws(BC) = ws(C) =0.6. Besides, because t(A)| – |t(AD)| and |t(D)| – |t(AD)|) we mustcompute ws(AD) (based on t(AD)). According to Corollary 4.1, weneed not compute ws values of 11 itemsets, namely {AE,AB,CB,DE,BD,EB,ADE,ADB,AEB,CEB,DEB,ABDE} (see Fig. 3 for moredetails).

Fig. 4. The modification of WIT-FWI algorithm

4.3. Diffset for computing ws values fast and saving memory

Zaki and Gouda (2003) proposed the Diffset strategy for fastcomputing the support of itemsets and saving memory to storeTidsets. We recognize that it can be used for fast computing thews values of itemsets. Diffset computes the difference set betweentwo Tidsets in the same equivalence class. In a dense database, thesize of Diffset is smaller than the Tidset (Zaki & Gouda, 2003; Zaki& Hsiao, 2005). Therefore, using Diffset will consume less storageand allow for the fast computing of weighted support values.

Let d(PXY) be the difference set between PX and PY. We have:

dðPXYÞ ¼ tðPXÞ n tðPYÞ ðZaki & Gouda; 2003Þ ð4:1Þ

for mining frequent weighted itemsets.

Page 6: A new method for mining Frequent Weighted Itemsets based on WIT-trees

B. Vo et al. / Expert Systems with Applications 40 (2013) 1256–1264 1261

where PX and PY are in equivalence class [P].Assume that we have d(PX) and d(PY), and need get d(PXY):

According to the results in Zaki and Gouda (2003), we can get iteasily by computing the difference set between d(PY) and d(PX):

dðPXYÞ ¼ dðPYÞ n dðPXÞ ðZaki & Gouda; 2003Þ ð4:2Þ

Based on Eqs. (4.1) and (4.2), we can compute the ws value of PXY byusing the d(PXY) as follows:

wsðPXYÞ ¼ wsðPXÞ �P

t2dðPXYÞtwðtÞPt2T twðtÞ ð4:3Þ

Proof. We have tðPXYÞ ¼ tðPXÞ \ tðPYÞ ¼ tðPXÞ n ½tðPXÞ n tðPYÞ� ¼

tðPXÞ n dðPXYÞ ) wsðPXYÞ ¼P

tk2tðPXYÞtwðtkÞPtk2T

twðtkÞ¼P

tk2½tðPXÞndðPXYÞ�twðtkÞPtk2T

twðtkÞ¼

Ptk2tðPXÞtwðtkÞ�

Ptk2dðPXYÞtwðtkÞP

tk2TtwðtkÞ

¼P

tk2tðPXÞtwðtkÞPtk2T

twðtkÞ�P

tk2dðPXYÞtwðtkÞPtk2T

twðtkÞ¼ wsðPXÞ�

Ptk2dðPXYÞtwðtkÞP

tk2TtwðtkÞ

. h

Based on Eqs. (4.1)–(4.3), we can use the Diffset strategy insteadof using Tidsets for computing the ws values of itemsets in the pro-cess of mining FWI.

Theorem 4.2. If dðPXYÞ ¼£ then ws(PXY) = ws(PX).

Proof. Because dðPXYÞ ¼£) wsðPXYÞ ¼ wsðPXÞ �P

t2dðPXYÞtwðtÞP

t2TtwðtÞ

¼wsðPXÞ. h

To save the memory for storing Diffset and the time for comput-ing Diffset, we sort itemsets in the same equivalence class inincreasing order by their ws.

4.3.1. WIT-FWI-DIFF – an algorithm based on DiffsetThe WIT-FWI-DIFF algorithm presented in Fig. 5 differs from the

WIT-FWI-MODIFY algorithm presented in Fig. 4 in that it uses

Fig. 5. WIT-FWI-DIFF algorithm for m

Diffset to compute the ws values. Because the first level storesthe Tidsets, if Lr belongs to the first level (line 9), means that liand lj belong to the first level, we use the Eq. (4.1) to computeY ¼ dðXÞ ¼ dðli [ ljÞ ¼ tðliÞ n tðljÞ (line 11, using Eq. (4.1)). From level2 stores Diffset, we use the Eq. (4.2) to compute Y ¼ dðXÞ ¼dðli [ ljÞ ¼ dðljÞ n _ðliÞ (line 13). Line 14 uses the Theorem 4.2 to fastcompute ws(X).

4.3.2. An example of DiffsetUsing the example data presented in Tables 1 and 3, and the

algorithm in Fig. 6, we illustrate the WIT-FWI-DIFF algorithmwith minws = 0.4 as follows. Level 1 of the WIT-tree contains sin-gle items, their tids, and their ws. They are sorted in increasingorder by their |tids|. The purpose of this work is to compute Diff-set faster. For example, consider nodes B and D, if they are notsorted, we must compute dðBDÞ ¼ tðBÞ n tðDÞ ¼ 123456 n 1356¼ 14; otherwise, dðDBÞ ¼ tðDÞ n tðBÞ ¼ 1356 n 123456 ¼£.

Consider node <A,1345,0.72>:

A joins C:

ining fre

dðACÞ ¼ tðAÞ n tðCÞ ¼ 1345 n 2456 ¼ 13) wsðACÞ

¼ wsðAÞ �P

t2dðACÞtwðtÞPt2T twðtÞ ¼ 0:72� 0:45þ 0:45

2:25¼ 0:32

< minws:

A joins D:

dðADÞ ¼ tðAÞ n tðDÞ ¼ 1345 n 1356 ¼ 4) wsðADÞ

¼ wsðAÞ �P

t2dðADÞtwðtÞPt2T twðtÞ ¼ 0:72� 0:3

2:25¼ 0:59

P minws:

quent weighted itemsets.

Page 7: A new method for mining Frequent Weighted Itemsets based on WIT-trees

Table 5Numbers of FWI from databases.

Database minws (%) #FWI

BMS-POS 10 128 216 324 85

Chess 85 262480 808875 20,29870 47,181

Mushroom 35 125730 293725 575120 53,853

Connect 96 101594 413192 11,31590 28,991

Accidents 95 1585 6575 28965 1035

Fig. 6. Results of the algorithm WIT-FWI-DIFF from the databases in Tables 1 and 3with minws = 0.4.

Chess

0

20

40

60

80

100

120

Tim

e (s

)

AprioriWIT- FWIsWIT- FWIs-MODIFYWIT- FWIs-DIFF

BMS-POS

0

2

4

6

8

10

12

14

16

10 8 6 4minws

Tim

e (s

)

AprioriWIT- FWIsWIT- FWIs-MODIFYWIT- FWIs-DIFF

Fig. 7. Run time for the four algorithms when mining FWI in the BMS-POS database.

1262 B. Vo et al. / Expert Systems with Applications 40 (2013) 1256–1264

A joins E:

dðAEÞ ¼ tðAÞ n tðEÞ ¼ 1345 n 12345 ¼£) wsðADÞ ¼ wsðAÞ¼ 0:72:

A joins B:

dðABÞ ¼ tðAÞ n tðBÞ ¼ 1345 n 123456 ¼£) wsðABÞ¼ wsðAÞ ¼ 0:72:

5. Experimental results

All experiments described below were performed on a Centrinocore 2 duo (2 � 2.53 GHz), 4GBs RAM memory, Windows 7, usingC# 2008. The experimental datasets used for the experimentationwere downloaded from http://fimi.cs.helsinki.fi/data/. Some statis-tical information regarding these data sets is given in Table 4. Wemodified these datasets by creating one table to store weightedvalues of items (values in the range of 1 to 10) for each database.

The results presented in Table 5 show that number of FWI ofBMS-POS is small, for Accidents it may be described as medium,and for Chess, Mushroom and Connect are large. It should be notedthat the number of FWI found in the Connect database changesrapidly; when we change minws from 96% down to 90%, thenumber of FWI changes from 1015 up to 28,991.

Experiments were also conducted to compare the processingtime of our three proposed algorithms (WIT-FWI, WIT-FWI-MODIFY, WIT-FWI-DIFF) with an Apriori-based algorithm(Ramkumar et al., 1998) (Apriori). Figs. 7–11 show the recordedrun times with respect to each of the selected above test data sets.

The experimental results presented in Figs. 7–11 demonstratethat our proposed algorithms are more efficient than an Apriori-based algorithm when mining FWI. When the number of FWI issmall, the running time of algorithms using WIT-trees is slightfaster than Apriori-based algorithm. For example, consider theBMS-POS database with minws = 4%, the mining time of Apriori is

85 80 75 70minws

Fig. 8. Run time for the four algorithms when mining FWI in the Chess database.Table 4Experimental databases.

Database (DB) #Trans #Items Remark

BMS-POS 515,597 1656 ModifiedConnect 67,557 130 ModifiedAccidents 340,183 468 ModifiedChess 3196 76 ModifiedMushroom 8124 120 Modified

14.74 (s), of WIT-FWI is 13.78 (s), of WIT-FWI-MODIFY is13.49 (s), and of WIT-FWI-DIFF is 12.27 (s). The scale run timedifference between Apriori and WIT-FWI-DIFF is 12:27

14:74� 100% ¼83:24%. However, when we compare the operation of thesealgorithms using the Chess and Mushroom databases (which

Page 8: A new method for mining Frequent Weighted Itemsets based on WIT-trees

Accidents

0

100

200

300

400

500

600

700

800

95 85 75 65minws

Tim

e (s

)

Apriori

WIT- FWIs

WIT- FWIs- MODIFY

WIT- FWIs- DIFF

Fig. 11. Run time for the four algorithms when mining FWI in the Accidentsdatabase.

Connect

0

200

400

600

800

1000

1200

96 94 92 90minws

Tim

e (s

)

Apriori

WIT- FWIs

WIT- FWIs- MODIFY

WIT- FWIs- DIFF

Fig. 10. Run time for the four algorithms when mining FWI in the Connectdatabase.

Mushroom

0

10

20

30

40

50

60

70

35 30 25 20minws

Tim

e (s

)

AprioriWIT- FWIsWIT- FWIs-MODIFYWIT- FWIs-DIFF

Fig. 9. Run time for the four algorithms when mining FWI in the Mushroomdatabase.

B. Vo et al. / Expert Systems with Applications 40 (2013) 1256–1264 1263

contain a large number of FWIs) the WIT-tree based algorithmswere found to be more efficient than that of the Apriori-basedalgorithm. For example: Consider the Chess database withminws = 70%, the mining time of Apriori is 111.82 (s), of WIT-FWIis 33.21 (s), of WIT-FWI-MODIFY is 27.85 (s), and of WIT-FWI-DIFFis 0.7 (s). The scale difference between Apriori-based andWIT-FWI-DIFF is 0:7

111:82� 100% ¼ 0:63% .

It should also be that the WIT-FWI-MODIFY algorithm is not asefficient as WIT-FWI when the number of FWI in the input data-base is small (for example in the case of the Accidents database).

6. Conclusions and future work

This paper has presented a method for mining frequentweighted itemsets from weighted item transaction databases,and a number of efficient algorithms have been proposed. Fromthe reported evaluation the mining (run) time to identify FWIsusing the proposed WIT-tree-based algorithms is significantly lessthan the time required using alternatives such as Apriori-basedFWI mining algorithms. This is because using the proposed WIT-tree data structure, the algorithms only scan the database once.The evaluation also indicated that use of the Diffset strategy allowsfor further efficiency gains.

In this paper, we have concentrated only on the mining of FWIs(using the proposed WIT-tree data structure). In reecent yearssome methods for the fast mining of association rules have beendiscussed (Vo & Le, 2009, 2011a,b). In future, we will study howto apply these methods to efficiently mine weighted associationrules from discovered FWIs using our method. Besides, we will ap-ply our method for mining weighted utility association rules(Khan, Muyeba, & Coenen, 2008). The mining of association rulesin incremental databases has also been considered in recent years(Hong, Lin, & Wu, 2008, 2009; Hong & Wang, 2010; Hong, Wu, &Wang, 2009; Lin et al., 2009). The intention is thus to also considerthe concept of mining weighted association rules from such incre-mental databases.

Acknowledgment

This work was supported by Vietnam’s National Foundation forScience and Technology Development (NAFOSTED) under GrantNumber 102.01-2012.47.

References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In:VLDB’94 (pp. 487–499).

Cai, C. H., Fu, A. W., Cheng, C. H., & Kwong, W. W. (1998). Mining association ruleswith weighted items. In: Proceedings of international database engineering andapplications symposium (IDEAS 98) (pp. 68–77).

Hong, T. P., Lin, C. W., & Wu, Y. L. (2008). Incrementally fast updated frequentpattern trees. Expert Systems with Applications, 34, 2424–2435.

Hong, T. P., Lin, C. W., & Wu, Y. L. (2009b). Maintenance of fast updated frequentpattern trees for record deletion. Computational Statistics and Data Analysis, 53,2485–2499.

Hong, T. P., & Wang, C. J. (2010). An efficient and effective association-rulemaintenance algorithm for record modification. Expert Systems withApplications, 37, 618–626.

Hong, T. P., Wu, Y. Y., & Wang, S. L. (2009a). An effective mining approach for up-to-date patterns. Expert Systems with Applications, 36, 9747–9752.

http://fimi.cs.helsinki.fi/data/ (download on April 2005).Khan, M. S., Muyeba, M., & Coenen, F. (2008). A weighted utility framework for

mining association rules. In: Proceedings of second uksim european symposium oncomputer modeling and simulation (pp. 87–92).

Le, B., Nguyen, H., Cao, T. A., & Vo, B. (2009). A novel algorithm for mining high utilityitemsets. In The first Asian conference on intelligent information and databasesystems (pp. 13–16). IEEE.

Le, B., Nguyen, H., & Vo, B. (2011). An efficient strategy for mining high utilityitemsets. International Journal of Intelligent Information and Database Systems,5(2), 164–176.

Lin, C. W., Hong, T. P., & Lu, W. H. (2009). The Pre-FUFP algorithm for incrementalmining. Expert Systems with Applications, 36(5), 9498–9505.

Ramkumar, G. D., Ranka, S., & Tsur, S. (1998). Weighted association rules: Model andalgorithm. In: SIGKDD’98 (pp. 661–666).

Tao, F., Murtagh, F., & Farid, M. (2003). Weighted association rule mining usingweighted support and significance framework. In: SIGKDD’03 (pp. 661–666).

Vo, B., Hong, T. P., & Le, B. (2012). DBV-Miner: A dynamic bit-vector approach forfast mining frequent closed itemsets. Expert Systems with Applications, 39(8),7196–7206.

Page 9: A new method for mining Frequent Weighted Itemsets based on WIT-trees

1264 B. Vo et al. / Expert Systems with Applications 40 (2013) 1256–1264

Vo, B., & Le, B. (2009). Mining traditional association rules using frequent itemsetslattice. In The 39th international conference on computers & industrial engineering(pp. 1401–1406). Troyes, France: IEEE.

Vo, B., & Le, B. (2011a). Mining minimal non-redundant association rules usingfrequent itemsets lattice. Journal of Intelligent Systems Technology andApplications, 10(1), 92–106.

Vo, B., & Le, B. (2011b). Interestingness measures for association rules: Combinationbetween lattice and hash tables. Expert Systems with Applications, 38(9),11630–11640.

Wang, W., Yang, J., & Yu, P. S. (2000). Efficient mining of weighted association rules.In: SIGKDD 2000 (pp. 270–274).

Zaki, M. J. (2004). Mining non-redundant association rules. Data Mining andKnowledge Discovery, 9(3), 223–248.

Zaki, M. J., & Gouda, K. (2003). Fast vertical mining using diffsets. In: SIGKDD’03 (pp.326–335).

Zaki, M. J., & Hsiao, C. J. (2005). Efficient algorithms for mining closed itemsets andtheir lattice structure. IEEE Transactions on Knowledge and Data Engineering,17(4), 462–478.