an effective approach for maintenance of pre-large-based frequent-itemset lattice in incremental...

17
Appl Intell DOI 10.1007/s10489-014-0551-z An effective approach for maintenance of pre-large-based frequent-itemset lattice in incremental mining Bay Vo · Tuong Le · Tzung-Pei Hong · Bac Le © Springer Science+Business Media New York 2014 Abstract Incremental mining has attracted the attention of many researchers due to its usefulness in online applica- tions. Many algorithms have thus been proposed for incre- mentally mining frequent itemsets. Maintaining a frequent- itemset lattice (FIL) is difficult for databases with large numbers of frequent itemsets, especially huge databases, due to the storage of links of nodes in the lattice. However, generating association rules from a FIL has been shown to be more effective than traditional methods such as directly generating rules from frequent itemsets or frequent closed itemsets. Therefore, when the number of frequent itemsets is not huge (i.e., they can be stored in the lattice with- out excessive memory overhead), the lattice-based approach B. Vo () · T. Le Division of Data Science, Ton Duc Thang University, Ho Chi Minh, Vietnam e-mail: [email protected] T. Le e-mail: [email protected] B. Vo · T. Le Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh, Vietnam T.-P. Hong Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, Republic of China e-mail: [email protected] T.-P. Hong Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan, Republicof China B. Le Department of Computer Science, University of Science, Ho Chi Minh, Vietnam e-mail: [email protected] outperforms approaches which mine association rules from frequent itemsets/frequent closed itemsets. However, incre- mental algorithms for building FILs have not yet been proposed. This paper proposes an effective approach for the maintenance of a FIL based on the pre-large concept in incremental mining. The building process of a FIL is first improved using two proposed theorems regarding the pater- nity relation between two nodes in the lattice. An effective approach for maintaining a FIL with dynamically inserted data is then proposed based on the pre-large and the diffset concepts. The experimental results show that the proposed approach outperforms the batch approach for building a FIL in terms of execution time. Keywords Data mining · Frequent itemsets · Lattice · Incremental mining · Pre-large concept 1 Introduction Association rule mining is an important problem in data mining [1, 2, 79, 16, 1922, 26, 2830]. For mining asso- ciation rules, existing algorithms are often divided into two distinct phases: (i) mining frequent (closed) itemsets and (ii) generating the association rules from these frequent (closed) itemsets. Phase (ii) is easily implemented but requires a lot of processing time. Recently, an approach for quickly generating association rules using a frequent-(closed)-itemset lattice (FIL/FCIL) has been proposed [2325]. The process of building an FIL/FCIL requires more time and memory (due to storage of links of nodes in the lattice) than those of the traditional approach, which directly mines frequent (closed) itemsets. However, generating association rules from a FIL/FCIL is more efficient than doing so from frequent (closed)

Upload: bac

Post on 27-Jan-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Appl IntellDOI 10.1007/s10489-014-0551-z

An effective approach for maintenance of pre-large-basedfrequent-itemset lattice in incremental mining

Bay Vo · Tuong Le · Tzung-Pei Hong · Bac Le

© Springer Science+Business Media New York 2014

Abstract Incremental mining has attracted the attention ofmany researchers due to its usefulness in online applica-tions. Many algorithms have thus been proposed for incre-mentally mining frequent itemsets. Maintaining a frequent-itemset lattice (FIL) is difficult for databases with largenumbers of frequent itemsets, especially huge databases,due to the storage of links of nodes in the lattice. However,generating association rules from a FIL has been shown tobe more effective than traditional methods such as directlygenerating rules from frequent itemsets or frequent closeditemsets. Therefore, when the number of frequent itemsetsis not huge (i.e., they can be stored in the lattice with-out excessive memory overhead), the lattice-based approach

B. Vo (�) · T. LeDivision of Data Science, Ton Duc Thang University, Ho ChiMinh, Vietname-mail: [email protected]

T. Lee-mail: [email protected]

B. Vo · T. LeFaculty of Information Technology, Ton Duc Thang University,Ho Chi Minh, Vietnam

T.-P. HongDepartment of Computer Science and Information Engineering,National University of Kaohsiung, Kaohsiung, Taiwan,Republic of Chinae-mail: [email protected]

T.-P. HongDepartment of Computer Science and Engineering, National SunYat-sen University, Kaohsiung, Taiwan, Republic of China

B. LeDepartment of Computer Science, University of Science, Ho ChiMinh, Vietname-mail: [email protected]

outperforms approaches which mine association rules fromfrequent itemsets/frequent closed itemsets. However, incre-mental algorithms for building FILs have not yet beenproposed. This paper proposes an effective approach forthe maintenance of a FIL based on the pre-large concept inincremental mining. The building process of a FIL is firstimproved using two proposed theorems regarding the pater-nity relation between two nodes in the lattice. An effectiveapproach for maintaining a FIL with dynamically inserteddata is then proposed based on the pre-large and the diffsetconcepts. The experimental results show that the proposedapproach outperforms the batch approach for building a FILin terms of execution time.

Keywords Data mining · Frequent itemsets · Lattice ·Incremental mining · Pre-large concept

1 Introduction

Association rule mining is an important problem in datamining [1, 2, 7–9, 16, 19–22, 26, 28–30]. For mining asso-ciation rules, existing algorithms are often divided into twodistinct phases: (i) mining frequent (closed) itemsets and (ii)generating the association rules from these frequent (closed)itemsets. Phase (ii) is easily implemented but requires a lotof processing time.

Recently, an approach for quickly generating associationrules using a frequent-(closed)-itemset lattice (FIL/FCIL)has been proposed [23–25]. The process of building anFIL/FCIL requires more time and memory (due to storageof links of nodes in the lattice) than those of the traditionalapproach, which directly mines frequent (closed) itemsets.However, generating association rules from a FIL/FCILis more efficient than doing so from frequent (closed)

B. Vo et al.

itemsets [25] or from frequent itemsets using hash tables[23, 24]. For example, the present authors previouslyshowed [24] that for the Mushroom database with minSup= 20 %, the number of frequent itemsets is 53,583 and thenumber of association rules is 19,191,656 (with minConf =0 %). The runtime for mining frequent itemsets and associ-ation rules from these itemsets is 80.83 s whereas that forFIL building and mining association rules from the FIL is14.13 s. Similarly, for the Chess database with minSup =65 %, the number of frequent itemsets is 111,239 and thenumber of association rules is 26,238,988 (with minConf =0 %). The runtime for mining frequent itemsets and asso-ciation rules from these itemsets is 326.12 s whereas thatfor FIL building and mining the association rules from theFIL is 35.66 s. Therefore, mining association rules basedon a FIL/FCIL is more efficient compared to the tradi-tional approach when both phases of mining are considered.However, for huge databases with large numbers of associa-tion rules, the lattice-based approach for mining associationrules is not the best solution. In that case, ranking andselecting useful rules is necessary [3, 6].

In practical applications, databases are typically incre-mental, meaning that transactions may be inserted. There-fore, mining association rules from incremental databaseshas attracted research interest. The first algorithm for min-ing association rules in incremental databases was Fast-UPdate (FUP) [5], an Apriori-based algorithm, which gener-ates the candidates and repeatedly scans the database. Someefficient methods have since been developed, such as meth-ods based on FP-tree [11, 12, 14], IT-tree [17], FIL/FCIL[15] and other methods [27]. Incremental mining in sequen-tial patterns [4, 13] and in high utility patterns [18] have alsobeen developed.

Hong et al. [10] proposed the pre-large concept (seeSection 2.2) for reducing the need for rescanning the orig-inal database. With this concept, the original database doesnot need to be rescanned if the number of new transactionsis equal to or less than a safety threshold. The maintenancecost is thus reduced with the pre-large concept. Althoughmany algorithms have been developed to maintain the asso-ciation rules in incremental databases, the maintenance ofa FIL in incremental mining has received little attention.Compared to the maintenance of frequent itemsets, that ofa FIL is more complex, with the algorithm having to con-sider the relations among the nodes in the lattice and updatethem. The present study proposes an effective approach forthe maintenance of a FIL. The main contributions are:

1. A structure for quickly building a FIL is proposed.2. Tidset-based maintenance of a pre-large FIL (TMPFIL)

algorithm is developed.3. Diffset-based maintenance of a PFIL (DMPFIL) algo-

rithm is developed.

The rest of the paper is organized as follows. Section 2presents the basic concepts. The DMPFIL algorithm isproposed in Section 3. Section 4 presents the results ofexperiments comparing the execution time of DMPFILwith those of TMPFIL and the FIL building algorithm toshow the effectiveness of the proposed algorithm. Finally,Section 5 summarizes the results and offers some futureresearch topics.

2 Basic concepts

2.1 Frequent-itemset lattice

Consider a database D with n transactions, with each trans-action including a set of items belonging to I , where I isthe set of all items in D. An example transaction databaseD1 is presented in Table 1. The support of an itemset X,denoted by σ (X), where X ∈ I , is the number of transac-tions in D which contain all the items in X. An itemset X iscalled a frequent itemset if σ (X) ≥ minSup × n, where min-Sup is a given threshold. An itemset with k items is called ak-itemset.

Vo and Le [23] proposed an algorithm for building aFIL in which each node in the lattice has the form 〈X,Tidset, Children〉, where X is a k-itemset, Tidset is the setof IDs associated with the transactions containing X, andChildren = {Y|Y ∈ (k+1)-itemsets and X ⊂ Y }.

2.2 Tidset-based frequent-itemset lattice building algorithm

When a node XA in a FIL is created, the algorithm pro-posed in [23, 24] has to find all the nodes which are thechildren of XA to update the lattice. This process first visitsall children of X (Y ∈ X.Children). With each Y , the pro-cess visits all children of Y (YB ∈ Y .Children). With eachYB, if XA ⊂YB, then the process updatesYB belonging to thechildren of XA (YB ∈ XA.Children). For example, considerthe lattice in Fig. 1. When the algorithm creates the nodeTC, it has to consider all the child nodes associated with T ,which are AT and TW. Then, the algorithm has to considerall the child nodes associated with AT and TW, which are

Table 1 Example transaction database D1

Transaction Items

1 A, C, T, W, Z

2 C, D, W

3 A, C, T, W

4 A, C, D, W

5 A, C, D, T, W

6 C, D, T

Maintenance of pre-large-based frequent-itemset lattice

Fig. 1 FIL for D1 with minSup= 50 % obtained using thealgorithm in [23, 24]

{ATW, ATC}and {ATW}. However, the process of consider-ing all child nodes of TW does not find any node that is thechild node of TC. The node ATW is a duplicate, and thusthe process of considering all child nodes associated withTW is meaningless. To overcome this weakness, this studyproposes the following structure for a FIL:

Definition 1 Let X be a k-itemset. The child nodes basedon the equivalence class feature associated with X are:

X.ChildrenEC = {XA|∀A ∈ I, A /∈ X} (1)

Definition 2 Let X be a k-itemset. The child nodes basedon the lattice feature associated with X are:

X.ChildrenL = {Y |Y is a (k + 1) - itemset,

Y /∈ X.ChildrenEC, and X ⊂ Y } (2)

Definition 3 Each node in the new lattice structure is atuple:

〈I temset, T idset, ChildrenEC, ChildrenL〉 (3)

where:

– Itemset is a k-itemset.– Tidset is the set of IDs associated with the transactions

containing Itemset.– ChildrenEC contains the child nodes based on the equiv-

alence class feature associated with Itemset.– ChildrenL contains the child nodes based on the lattice

feature associated with Itemset.

Theorem 1 Let XA be a node of a k-itemset. ∀XB ∈X.ChildrenEC and A is before B in the order of fre-quent 1-itemsets (sorted in ascending order of frequency).Then: � Y ∈ XB.ChildrenEC ∪ XB.ChildrenL so that Y ∈XA.ChildrenL.

Proof Assume that there exists the node of (k+1)-itemsetY in the lattice so that Y ∈ XB.ChildrenEC ∪ XB.ChildrenLand Y ∈ XA.ChildrenL. Then:

XA ⊂ Y because Y ∈ XA.ChildrenL (a)

XB ⊂ Y because Y ∈ XB.ChildrenEC ∪ XB.ChildrenL (b)

From (a) and (b), Y can only be XAB. However, thealgorithm used the depth-first-search strategy; therefore,XAB has not yet been created. Therefore, Theorem 1 isproven.

Theorem 2 Let XA be a node of a k-itemset and∀Z ∈ X.ChildrenL. Then: Y ∈ Z.ChildrenL so that Y ∈XA.ChildrenL.

Proof Assume that there exists the node of (k+1)-itemset Yin the lattice so that Y ∈ Z.ChildrenL and Y∈XA.ChildrenL.Then:

XA + {B} = Y, where B is any item because Y ∈XA.ChildrenL (c)

X + {C} = Z, where C is any item because Z ∈X.ChildrenL (d)

Z + {D} = Y, where D is any item because Y ∈Z.ChildrenL (e)

From (d) and (e),

X + {C, D} = Y. (f)

From (c) and (f), there are two cases:

1. C = A and D = B(d)−→ Z = X+ {C}= X+ {A}→ Z

is the same as XA. Therefore, this case is wrong.2. C = B and D = A. Then:

(f)−→ Y = X + {A, B} .

B. Vo et al.

Therefore, Y = XAB. However, the algorithm used thedepth-first-search strategy; therefore, XAB has not yet beencreated. Therefore, this case is also wrong.

From cases 1 and 2, �Y ∈ Z.ChildrenL so that Y ∈XA.ChildrenL. Theorem 2 is proven.

To understand the application of Theorems 1 and 2, theprocess of updating a lattice when the node XA, of a k-itemset, is created is described below. The algorithm in[23, 24] has to consider all nodes Y in the four cases shownin Table 2.

However, Theorems 1 and 2 show that for cases 1, 2, and4 (Table 2), there is no need to find the nodes of k-itemsetY in the current lattice as the child nodes of XA based onthe lattice feature. Therefore, the tidset-based FIL (TFIL)building algorithm can easily find the nodes which belongto XA.ChildrenL. The process first visits all children basedon the lattice feature of X (Y ∈ X.ChildrenL). With each Y ,this process visits all children based on the equivalence classfeature of Y (YB ∈ Y .ChildrenEC). With each YB, if XA ⊂YB, the process updatesYB belonging to the children basedon the lattice feature of XA (YB ∈ XA.ChildrenL). SeparatingChildren into ChildrenEC and ChildrenL makes the TFILalgorithm better than the algorithm in [23, 24] because iteliminates a large number of candidates in cases 1, 2, and 4in Table 2. The TFIL algorithm is presented in Fig. 2.

The TFIL algorithm is applied to the example database inTable 1 with minSup = 50 % to illustrate its use. First, TFILfinds all the frequent 1-itemsets and sorts them in ascendingorder of frequency. The result of this step is I1 = {A, D, T ,W , C}. The algorithm uses the depth-first-search strategyto generate the candidates associated with each equivalenceclass. The frequent (k-1)-itemsets in turn are combined withthe remaining (k-1)-itemsets in this equivalence class to cre-ate the k-itemset candidates. The frequent itemsets fromthese candidates are used to create the frequent (k+1)-itemsets. When each node X in the lattice is created, thealgorithm calls the procedure Update Lattice to update thechild nodes based on the lattice feature associated withX which are created in the previous steps. For instance,the first frequent 1-itemset in I1, A, is combined with theremaining frequent 1-itemsets {D, T , W , C} to create thecandidates {AD, AT, AW, AC}. However, AD is excludedbecause σ (AD) = 2 < minSup × n = 50 % × 6 = 3.The frequent 2-itemsets associated with equivalence class

Table 2 Four cases considered when updating the lattice in [23, 24]

Case Nodes of (k-1)-itemsets Nodes of k-itemsets

1 Z ∈ X.ChildrenEC Y ∈Z.ChildrenEC

2 Z ∈ X.ChildrenEC Y ∈Z.ChildrenL

3 Z ∈ X.ChildrenL Y ∈Z.ChildrenEC

4 Z ∈ X.ChildrenL Y ∈ Z.ChildrenL

A are thus {AT, AW, AC}. Then, the algorithm combinesthe first frequent 2-itemset in this list, AT, with the remain-ing frequent 2-itemsets in this list {AW, AC} to create thefrequent 3-itemsets {ATW, AWC}. ATW is combined withAWC to create the frequent 4-itemset {ATWC}, finishing theprocessing of equivalence class A. The algorithm similarlyprocesses the remaining equivalence class {D, T , W , C}.The results for the example database are shown in Fig. 3.Note that the dashed and solid lines represent ChildrenL andChildrenEC, respectively.

2.3 Diffset-based FIL building algorithm

The diffset definition was first presented by [28]. Theyproposed a method for quickly determining the supportassociated with a k-itemset based on the support of the (k-1)-itemset. Let the tidsets associated with XA, XB, and XABbe t(XA), t(XB), and t(XAB), respectively. Then:

t (XAB) = t (XA) ∩ t (XB) (4)

Let d(XAB) = t(XA) \t(XB) = d(XB)\d(XA) [28] be theTIDs that exist in t(XA) but not in t(XB). d(XAB) is calledthe diffset associated with XAB. The support associated withXAB is determined as:

σ (XAB) = σ (XA) − |d (XAB)| (5)

The diffset concept is used in the diffset-based FIL (DFIL)building algorithm, which is shown in Fig. 4. The functionFIL and the procedure Update Lattice are not shown herebecause they are the same as those in the TFIL algorithm (inFig. 2).

The lattice built using the DFIL algorithm for the exam-ple database with minSup = 50 % is shown in Fig. 5. The lat-tice obtained using the TFIL algorithm (L1) (see Fig. 3) andthat obtained using the DFIL algorithm (L2) (see Fig. 5) arebasically the same. However, L2 has the total size of diff-sets associated with all the frequent itemsets which is lessthan the total size of tidsets associated with all the frequentitemsets in L1. Therefore, using the diffset concept helpsthe DFIL algorithm reduce memory usage and executiontime.

2.4 Pre-large concept in incremental mining

The pre-large concept was proposed in [10]. It is based on asafety threshold f to reduce the need of rescanning the orig-inal database for efficiently maintaining association rules.The safety number f of inserted transactions is derived asfollows:

f =⌊

(SU − SL) × |D|1 − SU

⌋(6)

Maintenance of pre-large-based frequent-itemset lattice

Fig. 2 TFIL algorithm

where SU is the upper threshold, SL is the lower threshold,and |D| is the number of original database D’s transactions.When the number of new transactions is equal to or lessthan f , the algorithm does not need to rescan the originaldatabase.

When two thresholds are used, each itemset has threecases: frequent, pre-large, and infrequent. This divides item-sets in the original and new databases into nine cases [10],as presented in Table 3.

Example 1 Consider the database in Table 1 with |D1|=6. Assume that SU = 70 % and SL = 50 %. Then, if the

number of new transactions is equal to or less than f =⌊(0.7−0.5)×(6)

1−0.7

⌋= 4, the algorithm does not need to rescan

the original database to determine the support of infrequentitemsets which are mined from the original database.

3 Tidset-based maintenance of PFIL algorithm

The tidset-based maintenance of a PFIL (TMPFIL) algo-rithm only uses the tidset concept for two sub-algorithms:TFIL (see Section 2.2) and a tidset-based approach formaintaining a PFIL.

B. Vo et al.

Fig. 3 FIL for D1 with minSup= 50 % obtained using TFILalgorithm

3.1 TMPFIL algorithm

Consider the following two comments on the pre-large FIL:

1. When the paternity relations based on the lattice featureare ignored, the lattice becomes a frequent-itemset treethat contains all the itemsets in the lattice. Therefore,the algorithm traverses only the paternity relationsbased on the equivalence class feature to visit all thenodes of frequent itemsets.

2. When the algorithm finds the direct child node of XAand XB in the lattice where A is before B in the order

Fig. 4 DFIL algorithm

of frequent 1-itemsets sorted in ascending order of fre-quency, the algorithm only needs to find XAB fromChildrenEC of XA and from ChildrenL of XB becauseXAB can only be in ChildrenEC of XA (by Definition 1)and ChildrenL of XB (by Definition 2). Note that thedirect child node of XA and XB is only XAB if XAB’ssupport satisfies the threshold.

Based on these comments, the TMPFIL algorithm is pro-posed. For each increment, the process of this algorithm isas follows:

1. If the original database is empty, the algorithmuses the lower threshold SL to build a PFIL andrecalculates the safety threshold f for incrementaldatabase D′.

2. If the number of transactions in incremental database D′is larger than f , the algorithm uses SL to build a PFILand recalculates the safety threshold f for D + D′.

3. If the number of transactions in incremental databaseD′ is equal to or less than f , the algorithm updates thePFIL without scanning the database.

Fig. 5 FIL for database D1 with minSup = 50 % obtained using DFILalgorithm

Maintenance of pre-large-based frequent-itemset lattice

Table 3 Nine cases of itemsets

Case Original – new Result

1 frequent – frequent frequent

2 frequent – pre-large frequent/pre-large, determined from existing itemsets

3 frequent – in frequent frequent/pre-large/infrequent, determined from existing itemsets

4 pre-large – frequent pre-large/frequent, determined from existing itemsets

5 pre-large – pre-large pre-large

6 pre-large – infrequent pre-large/infrequent, determined from existing itemsets

7 infrequent – frequent pre-large/infrequent, rescan the original if the number of inserted transactions is larger than f

8 infrequent – pre-large infrequent/pre-large, rescan the original if the number of inserted transactions is larger than f

9 infrequent – infrequent infrequent

4. The original database is updated as D = D + D′.

The TMPFIL algorithm is presented in Fig. 6.

3.2 Illustration of TMPFIL algorithm

In this section, an example is given to illustrate the pro-cess of TMPFIL in three rounds of updating the databasewith the initial database D = ∅, SL = 50 %, andSU = 65 %.

First round (D1 with the six transactions in Table 1)Because the condition in line 1 is true, the algorithm per-forms lines 2 and 3. The result of this increment is thePFIL shown in Fig. 3. Then, the algorithm updates f =⌊

(SU−SL)×(|D|)1−SU

⌋=

⌊(0.65−0.5)×(6)

1−0.65

⌋= 2 and D = ∅ + D1.

Second round (D2 with two transactions in Table 4)Because |D2|= 2 = f , the algorithm performs lines 8, 9,10, and 11 in the procedure INCREMENTAL-FIL.

1. The algorithm clears all the tidset information associ-ated with all the nodes in the PFIL (see Fig. 7).

2. The algorithm inserts the tidset information (only fortransactions 7 and 8) associated with the frequent 1-itemsets in the PFIL and then marks the updated nodes(Fig. 8).

3. The algorithm calls the procedure UPDATE-PFIL,which is called recursively in depth-first search, toupdate the tidset information of all nodes in the PFIL.The result of this step is shown in Fig. 9.

4. The algorithm updates the safety threshold f = f -|D2|= 2 – 2 = 0.

5. The algorithm updates the database D = D1 + D2.

Figure 9 shows that only a small number of nodes whichhave their tidset information updated in the PFIL are usedto update the lattice.

Third round (D3 with one transaction in Table 5) Because|D|>0 and |D3|= 1 >f = 0, the algorithm performs lines5 and 6 in the procedure INCREMENTAL-FIL.

1. The algorithm calls the function FIL to create the PFILwith D with the nine transactions in Tables 1, 4 and 5.The result of this step is shown in Fig. 10.

2. The algorithm calculates the safety threshold f =⌊(SU−SL)×(|D|)

1−SU

⌋=

⌊(0.65−0.5)×(9)

1−0.65

⌋= 3. Therefore,

for the fourth increment, if |D4|≤ 3, the algorithmupdates the PFIL without scanning the database.

4 Diffset-based maintenance of PFIL algorithm

In this section, the diffset-based maintenance of a PFIL(DMPFIL) algorithm is proposed (see Fig. 11). Onlythe diffset concept for two sub-algorithms, DFIL (seeSection 2.3) and a diffset-based approach for maintaininga PFIL, are used. Note that only the procedure UPDATE-PFIL is given since the other procedures are the same asabove (Fig. 6).

To use the diffset concept, each node in the lattice hasa temporal support (Tsupport) field. The Tsupport asso-ciated with a node of a frequent 1-itemset is the numberof items in a tidset (only for inserted transactions). TheTsupport associated with a node of frequent k-itemset XAB(k >1) is determined using equation (5). It can be re-writtenas follows:

XAB.T support = XA.T support − |XAB.T idset |

where A is before B in the order of frequent 1-itemsetsand |XAB.Tidset| is the number of elements in the diffset ofXAB.

To illustrate the DMPFIL algorithm, the PFIL of step 3in the second increment database from Section 3.2 is shownin Fig. 12.

B. Vo et al.

Fig. 6 TMPFIL algorithm

The PFIL in Fig. 12 is clearly better than that in Fig. 9 interms of memory usage, which also speeds up the executiontime. Hence, DMPFIL is more effective than TMPFIL.

Table 4 Incremental database in the second increment (D2)

Transaction Items

7 A, T, W, Z

8 C, T, W, Z

5 Experimental results

All experiments presented in this section were performedon a laptop with an Intel i3-3110 M 2.4-GHz CPU and4 GB of RAM running Windows 8. All the programs werecoded in C# (version 4.5.50709). The experiments wereconducted using the following UCI databases1: Accidents,

1http://fimi.cs.helsinki.fi/data/

Maintenance of pre-large-based frequent-itemset lattice

Fig. 7 PFIL of step 1 in the second increment

Fig. 8 PFIL of step 2 in the second increment

Fig. 9 PFIL of step 3 in the second increment

Table 5 Incremental database in the third increment (D3)

Transaction Items

9 A, D, T, W, Z

Fig. 10 PFIL of step 1 in the third increment

Fig. 11 DMPFIL algorithm

Fig. 12 PFIL of step 3 in the second increment database obtainedusing DMPFIL algorithm

B. Vo et al.

Fig. 13 Execution time ofTMPFIL, DMPFIL (SU = 80 %and SL = 78 %) and TFIL, DFIL(SU = 80 %) for Accidents

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

333379 334059 334739 335419 336099 336779 337459 338139 338819 339499

TFIL

DFIL

TMPFIL

DMPFIL

Exec

u�on

�m

e (s

econ

ds)

Number of transac�ons

0

2

4

6

8

10

12

14

16

TFIL DFIL TMPFIL DMPFIL

Exec

u�on

�m

e (s

econ

ds)

Algorithm

Fig. 14 Execution time ofTMPFIL, DMPFIL (SU = 95 %and SL = 93 %) and TFIL,DFIL (SU = 95 %) for Connect

0

1

2

3

4

5

6

66205 66340 66475 66610 66745 66880 67015 67150 67285 67420

TFIL

DFIL

TMPFIL

DMPFIL

Exec

u�on

�m

e (s

econ

ds)

Number of transac�ons

0

5

10

15

20

25

30

35

40

45

50

TFIL DFIL TMPFIL DMPFIL

Exec

u�on

�m

e (s

econ

ds)

Algorithm

Maintenance of pre-large-based frequent-itemset lattice

Table 6 Statistical summary of experimental databases

Database # of trans # of items

Accidents 340,183 468

Chess 3,196 76

Mushroom 8,124 120

Pumsb star 49,046 7,117

Retail 88,162 16,470

Chess, Mushroom, Pumsb star, and Retail. A statisticalsummary of these databases is shown in Table 5.

In this section, the total execution times of the DMPFIL,TMPFIL, TFIL, and DFIL algorithms are compared. Notethat TFIL and DFIL used upper threshold SU to build theFIL.

5.1 Execution time comparison

Figure 13 compare the execution time of four algorithms,namely TFIL, DFIL, TMPFIL, and DMPFIL, for the Acci-dents database. Results show that the time for PFIL main-tenance is smaller than that for building a FIL using thebatch way. For 10 runs, the total execution times for TFIL,DFIL, TMPFIL, and DMPFIL were 14.563, 4.982, 0.064,

and 0.144 s. In this case, updating the lattice using diffset isslightly slower than using tidset because only 680 transac-tions were inserted in each increment, which is a very smallnumber of transactions compared to those in the originaldatabase (Table 6).

For the Connect database (Fig. 14), the execution timefor updating the lattice using diffset is less than that usingtidset. The total times for the four algorithms are 43.009,1.759, 0.521, and 0.114 s, respectively.

Similar results were obtained for Mushroom andPumsb star (Figs. 15 and 16, respectively). Updating the lat-tice when the database is inserted is often faster than usingthe batch way. Using diffset is often more efficient thanusing tidset.

Figure 17 shows a case for which the proposed approachperforms poorly. When SL is very small, our approach maybe slower than the batch way. In this very sparse database(Retail), using diffset is also sometimes slower than usingtidset.

5.2 Execution time comparison between tidset-basedand diffset-based approaches

The lower support thresholds were changed to comparethe execution times for updating the lattice using tidset-based and diffset-based approaches (Figs. 18, 19, 20, 21,and 22). SU was fixed and SL was changed according to

Fig. 15 Execution time ofTMPFIL, DMPFIL (SU = 30 %and SL = 28 %) and TFIL, DFIL(SU = 30 %) for Mushroom

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

7961 7977 7962 7978 7963 7979 7964 7980 7965 7981

DFIL

TMPFIL

DMPFIL

TFIL

Exec

u�on

�m

e (s

econ

ds)

Number of transac�ons

0

0.5

1

1.5

2

2.5

3

3.5

TFIL DFIL TMPFIL DMPFIL

Exec

u�on

�m

e (s

econ

ds)

Algorithm

B. Vo et al.

Fig. 16 Execution time ofTMPFIL, DMPFIL (SU = 60 %and SL = 58 %) and TFIL, DFIL(SU = 60 %) for Pumsb star

0

0.05

0.1

0.15

0.2

0.25

0.3

48065 48163 48066 48164 48067 48165 48068 48166 48069 48167

TFIL

DFIL

TMPFIL

DMPFIL

Exec

u�on

�m

e (s

econ

ds)

Number of transac�ons

0

0.5

1

1.5

2

2.5

3

TFIL DFIL TMPFIL DMPFIL

Exec

u�on

�m

e (s

econ

ds)

Algorithm

Fig. 17 Execution time ofTMPFIL, DMPFIL (SU = 0.5 %and SL = 0.495 %) and TFIL,DFIL (SU = 0.5 %) for Retail

0

0.2

0.4

0.6

0.8

1

1.2

86398 86574 86750 86926 87102 87278 87454 87630 87806 87982

TFIL

DFIL

TMPFIL

DMPFILExec

u�on

�m

e (s

econ

ds)

Number of transac�ons

0

2

4

6

8

10

12

TFIL DFIL TMPFIL DMPFIL

Exec

u�on

�m

e (s

econ

ds)

Algorithm

Maintenance of pre-large-based frequent-itemset lattice

Fig. 18 Total execution time ofTMPFIL and DMPFIL withSU = 70 % and SL = SU − �s

for Accidents

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

333379 334059 334739 335419 336099 336779 337459 338139 338819 339499

TMPFIL(0.5)

DMPFIL(0.5)

TMPFIL(1)

DMPFIL(1)

TMPFIL(1.5)

DMPFIL(1.5)

TMPFIL(2)

DMPFIL(2)Ex

ecu�

on �

me

(sec

onds

)

Number of transac�ons

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TMPFIL(0.5) DMPFIL(0.5) TMPFIL(1) DMPFIL(1) TMPFIL(1.5) DMPFIL(1.5) TMPFIL(2) DMPFIL(2)

Exec

u�on

�m

e (s

econ

ds)

Algorithm

Fig. 19 Total execution time ofTMPFIL and DMPFIL withSU = 90 % and SL = SU − �s

for Connect

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

66205 66340 66475 66610 66745 66880 67015 67150 67285 67420

TMPFIL(0.5)

DMPFIL(0.5)

TMPFIL(1)

DMPFIL(1)

TMPFIL(1.5)

TMPFIL(1.5)

TMPFIL(2)

DMPFIL(2)

Exec

u�on

�m

e (s

econ

ds)

Number of transac�ons

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

TMPFIL(0.5) DMPFIL(0.5) TMPFIL(1) DMPFIL(1) TMPFIL(1.5) DMPFIL(1.5) TMPFIL(2) DMPFIL(2)

Exec

u�on

�m

e (s

econ

ds)

Algorithm

B. Vo et al.

Fig. 20 Total execution time ofTMPFIL and DMPFIL withSU = 20 % and SL = SU − �s

for Mushroom

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

7961 7977 7962 7978 7963 7979 7964 7980 7965 7981

TMPFIL(0.5)

DMPFIL(0.5)

TMPFIL(1)

DMPFIL(1)

TMPFIL(1.5)

DMPFIL(1.5)

TMPFIL(2)

DMPFIL(2)Ex

ecu�

on �

me

(sec

onds

)

Number of transac�ons

0

0.2

0.4

0.6

0.8

1

1.2

TMPFIL(0.5) DMPFIL(0.5) TMPFIL(1) DMPFIL(1) TMPFIL(1.5) DMPFIL(1.5) TMPFIL(2) DMPFIL(2)

Exec

u�on

�m

e (s

econ

ds)

Algorithm

Fig. 21 Total execution time ofTMPFIL and DMPFIL withSU = 45 % and SL = SU − �s

for Pumsb star

0

0.2

0.4

0.6

0.8

1

1.2

48065 48163 48066 48164 48067 48165 48068 48166 48069 48167

TMPFIL(0.5)

DMPFIL(0.5)

TMPFIL(1)

DMPFIL(1)

TMPFIL(1.5)

DMPFIL(1.5)

TMPFIL(2)

DMPFIL(2)

Exec

u�on

�m

e (s

econ

ds)

Number of transac�ons

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

TMPFIL(0.5) DMPFIL(0.5) TMPFIL(1) DMPFIL(1) TMPFIL(1.5) DMPFIL(1.5) TMPFIL(2) DMPFIL(2)

Exec

u�on

�m

e (s

econ

ds)

Algorithm

Maintenance of pre-large-based frequent-itemset lattice

Fig. 22 Total execution time ofTMPFIL and DMPFIL withSU = 0.5 % and SL = SU − �s

for Retail

1

1.05

1.1

1.15

1.2

1.25

1.3

86398 86574 86750 86926 87102 87278 87454 87630 87806 87982

TMPFIL(0.005)

DMPFIL(0.005)

TMPFIL(0.01)

DMPFIL(0.01)

TMPFIL(0.015)

DMPFIL(0.015)

TMPFIL(0.02)

DMPFIL(0.02)Ex

ecu�

on �

me

(sec

onds

)

Number of transac�ons

10.811

11.211.411.611.8

1212.212.412.6

Exec

u�on

�m

e (s

econ

ds)

Algorithm

SL = SU − �s. For example, with the Accidents database,SU was fixed at 70 % and �s was 0.5 %, 1 %, 1.5 %,and 2 %, giving SL values of 69.5 %, 69 %, 68.5 %, and68 %, respectively. Figure 18 shows that when �s = 0.5 %,the total time is highest because this process must rebuildthe PFIL at the 9th increment. When �s = 1 %, the totaltime is lowest because the PFIL does not need to be rebuiltand the number of nodes that need to be considered in thelattice is smaller than those for �s = 1.5 % and 2 %(Fig. 18).

Consider the Connect database (Fig. 19). The best resultwas obtained with �s = 0.5 % because in this case, the

algorithm does not rebuild the lattice. The total executiontimes for updating the lattice with TMPFIL and DMPFILare 2.394 and 0.763 s with �s = 0.5 %, 2.841 and 0.955 swith �s = 1 %, 3.353 and 1.250 s with �s = 1.5 %, and3.959 and 1.461 s with �s = 2 %, respectively.

This section compares the execution time when thenumber of inserted transactions is changed. Figures 23,24, 25 and 26 show that DMPFIL is always better thanTMPFIL in all cases. However, Fig. 27 shows a case forwhich the proposed approach performs poorly. In this verysparse database (Retail), DMPFIL is sometimes slower thanTMPFIL.

Fig. 23 Execution time ofTMPFIL and DMPFIL (SU =70 % and SL = 68 %) forAccidents with Row start =300,000 and Row end =340,183

B. Vo et al.

Fig. 24 Execution time ofTMPFIL and DMPFIL (SU =90 % and SL = 88 %) forConnect with Row start =64,000 and Row end = 67,557

Fig. 25 Execution time ofTMPFIL and DMPFIL (SU =20 % and SL = 18 %) forMushroom with Row start =8,000 and Row end = 8,124

Fig. 26 Execution time ofTMPFIL and DMPFIL (SU =45 % and SL = 43 %) forPumsb star with Row start =45,000 and Row end = 49,046

Fig. 27 Execution time ofTMPFIL and DMPFIL (SU =0.5 % and SL = 0.495 %) forRetail with Row start = 85,000and Row end = 88,162

Maintenance of pre-large-based frequent-itemset lattice

6 Conclusions and future work

This paper proposed an effective algorithm for the main-tenance of a FIL. The proposed approach has two phases:(i) building the FIL and (ii) maintaining the PFIL. For thefirst phase, two theorems were proposed and the diffset con-cept was used for quickly building a FIL. For the secondphase, the DMPFIL algorithm uses the pre-large and diffsetconcepts for the maintenance of the PFIL. The experimen-tal results show that DMPFIL is more effective than TFIL,DFIL, and TMPFIL in terms of execution time.

A disadvantage of the proposed method is that it mustbuild the FIL based on the whole database (original andinserted) when the number of inserted transactions is largerthan the safety threshold. Besides, using pre-large concept,we have to maintain a small number of infrequent itemsetsin the lattice. In future work, we will study how to storemore information to avoid rescanning the original databaseand avoid storing a number of infrequent itemsets in thelattice for fast maintaining the FIL. Using a FCIL for gener-ating association rules is very effective; therefore, a methodfor updating the FCIL when the database is changed will bedeveloped. In addition, how to update a FIL for huge andvery dense databases will be studied.

References

1. Agrawal R, Imielinski T, Swami AN (1993) Mining associationrules between sets of items in large databases. In: SIGMOD’93.pp 207–216

2. Agrawal R, Srikant R (1994) Fast algorithms for mining associa-tion rules. In: VLDB’94. pp 487–499

3. Cai R, Tung AKH, Zhang Z, Hao Z (2011) What is unequal amongthe equals? Ranking equivalent rules from gene expression data.IEEE Trans Knowl Data Eng 23(11):1735–1747

4. Chen H, Yan X, Han J (2004) IncSpan: Incremental miningof sequential patterns in large databases. In: KDD’04. pp 527–532

5. Cheung DW, Han J, Ng VT, Wong CY (1996) Maintenance ofdiscovered association rules in large databases: an incrementalupdating approach. In: ICDE’96. pp 106–114

6. Cong G, Tan K-L, Tung AKH, Xu X (2005) Mining Top-k cov-ering rule groups for gene expression data. In: SIGMOD’05. pp670–681

7. Dong J, Han M (2007) BitTableFI: an efficient mining frequentitemsets algorithm. Knowl Based Syst 20(4):329–335

8. Grahne G, Zhu J (2005) Fast algorithms for frequent itemsetmining using FP-trees. IEEE Trans Knowl Data Eng 17:1347–1362

9. Han J, Pei J, Yin Y (2000) Mining frequent patterns withoutcandidate generation. In: SIGMOD’00. pp 1–12

10. Hong TP, Wang CY, Tao YH (2001) A new incremental data min-ing algorithm using pre-large itemsets. Int Data Anal 5(2):111–129

11. Hong TP, Lin CW, Wu YL (2008) Incrementally fast updatedfrequent pattern trees. Expert Syst Appl 34(4):2424–2435

12. Hong TP, Lin CW, Wu YL (2009) Maintenance of fast updatedfrequent pattern trees for record deletion. Comput Stat Data Anal53(6):2485–2499

13. Hong TP, Wang CY, Tseng SS (2011) An incremental min-ing algorithm for maintaining sequential patterns using pre-largesequences. Expert Syst Appl 38(5):7051–7058

14. Koh JL, Shied SF (2004) An efficient approach for maintain-ing association rules based on adjusting FP-tree structures. In:DASFAA’04. pp 417–424

15. La PT, Le B, Vo B (2014) Incrementally building frequent closeditemset lattice. Expert Syst Appl 41(5):2703–2712

16. Le T, Vo B (2014) MEI: an efficient algorithm for mining erasableitemsets. Eng Appl Artif Int 27:155–166

17. Le TP, Hong TP, Vo B, Le B (2012) An efficient incrementalmining approach based on IT-tree. In: RIVF’12. pp 57–61

18. Lin CW, Hong TP, Lan GC, Wong JW, Lin WY (2014) Incremen-tally mining high utility patterns based on pre-large concept. ApplInt 40(2):343–357

19. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Efficient min-ing of association rules using closed Itemset lattices. Inform Syst24(1):25–46

20. Shie BE, Philip SY, Tseng VS (2013) Mining interesting userbehavior patterns in mobile commerce environments. Appl Int38(3):418–435

21. Song W, Yang B, Xu Z (2008) Index-BitTableFI: an improvedalgorithm for mining frequent itemsets. Knowl-Based Syst21:507–513

22. Song W, Liu Y, Li J (2014) Mining high utility itemsetsby dynamically pruning the tree structure. Appl Int 40(1):29–43

23. Vo B, Le B (2009) Mining traditional association rules usingfrequent itemsets lattice. In: CIE’09. pp 1401–1406

24. Vo B, Le B (2011) Interestingness measures for association rules:Combination between lattice and hash tables. Expert Syst Appl38(9):11630–11640

25. Vo B, Hong TP, Le B (2013) A lattice-based approach for miningmost generalization association rules. Knowl-Based Syst 45:20–30

26. Yen SJ, Lee YS (2013) Mining non-redundant time-gap sequentialpatterns. Appl Int 39(4):727–738

27. Yen SJ, Lee YS, Wang CK (2013) An efficient algorithmfor incrementally mining frequent closed itemsets. Appl Int.doi:10.1007/s10489-013-0487-8 (in press)

28. Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. InKDD’03. pp 326–335

29. Zaki MJ, Hsiao CJ (2005) Efficient algorithms for mining closeditemsets and their lattice structure. IEEE Trans Knowl Data Eng17(4):462–478

30. Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algo-rithms for fast discovery of association rules. In: KDD’97. pp 283–286