[ieee industrial engineering (cie39) - troyes, france (2009.07.6-2009.07.9)] 2009 international...

5
Fast Algorithm for Mining Minimal Generators of Frequent Closed Itemsets and Their Applications Bay VOl, Bac Le 2 'Faculty of Information Technology, Ho Chi Minh City University of Technology, Vietnam ([email protected]) 2paculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam ([email protected]) ABSTRACT The number of frequent closed itemsets (FCIs) are usually fewer than numbers of frequent itemsets. However, it is necessary to find Minimal Generators (mGs) for mining association rule from them. The finding mGs approaches based on generating candidate lose timeliness when the number of frequent closed itemsets are large. In this paper, we present MG-CHARM, an efficient algorithm for finding all mGs of frequent closed itemsets. Based on the mGs properties mentioned in section 2.4, we develop an algorithm which does not generate candidates by mining directly the mGs of frequent closed itemsets at mining FCls. Thus, the time for finding mGs of frequent closed itemsets is insignificant. Experiment shows that the time ofMG-CHARM is fewer than the time of finding mGs after finding all closed itemsets (CHARM), especially in case of the the length of each frequent closed itemset is long. We also present applications of mGs for mining (minimal) non-redundant association rules. Keywords: MG-CHARM, frequent closed itemset, minimal generators, non-redundant association rules. 2.1.2. Support x T. If an item iEI occurs in a transaction lET, we write it as (i,l) E 8 or alternately as i 81. The second transaction can be represented as {C82, 082, W82}. Given transaction database 0 and an itemset I . The support of X in 0, noted a (X) , is the number of transactions that X occurs in D. Vertical format Item Transactions A 1,3,4,5 C 1, 2, 3, 4, 5, 6 D 2,4,5,6 T 1,3,5,6 W 1,2,3,4,5 Transaction Content ID 1 A,C,T,W 2 C,D,W 3 A,C,T,W 4 A,C,D,W 5 A,C,D, T, W 6 C,D,T Table 1. Example database => Example: consider the database in [4] 1. INTRODUCTION At present, almost all algorithms for mining Minimal Generators of FCIs are based on Apriori algorithm [3,5]. The first method was proposed in [3], the authors extended from Apriori algorithm to find mGs. Firstly, it found candidates that are mGs, then defined their closures to find out frequent closed itemsets. The second method was represented by Zaki [5]. Firstly, he found all frequent closed itemsets using CHARM algorithm [4]. Then he used level-wise method to find out all mGs that correspond to each closed itemset. Both of these methods have disadvantage in large size of frequent itemsets since the number of considered candidates is large. In this paper, we propose a fast algorithm for mining mGs based on CHARM. It overcomes the disadvantage of above two methods by using CHARM to generate FCI and also find mGs of them following lemmas in section 2.4. Besides, we also present the applications of mGs in mining non-redundant association rules. 2. FREQUENT ITEMSET AND MINIMAL GENERATORS 2.1.3. Frequent itemset 2.1. Definitions X I is called frequent itemset if a(X) minSup (where minSup is the value specified by user). 2.1.1. Transaction database 2.1.4. Galois connection [5] Let I = {i}, i 2 , ••• , in} be the set of all items. T = {t., t 2 , ••. , t m } be the set of all transactions in transaction database D. The given database is a binary relation 8 I Let the binary relation 8 I x T contains database that need to mine. Let X I and Y T, peS) is the power set of S. We denote two mappings between pel) and P(l), 978-1-4244-4136-5/09/$25.00 ©2009 IEEE 1407

Upload: bac

Post on 09-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Fast Algorithm for Mining Minimal Generators of Frequent Closed Itemsetsand Their Applications

Bay VOl, Bac Le2

'Faculty of Information Technology, Ho Chi Minh City University of Technology, Vietnam([email protected])

2paculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam([email protected])

ABSTRACT

The number of frequent closed itemsets (FCIs) are usually fewer than numbers of frequent itemsets. However, it isnecessary to find Minimal Generators (mGs) for mining association rule from them. The finding mGs approaches basedon generating candidate lose timeliness when the number of frequent closed itemsets are large. In this paper, we presentMG-CHARM, an efficient algorithm for finding all mGs of frequent closed itemsets. Based on the mGs propertiesmentioned in section 2.4, we develop an algorithm which does not generate candidates by mining directly the mGs offrequent closed itemsets at mining FCls. Thus, the time for finding mGs of frequent closed itemsets is insignificant.Experiment shows that the time ofMG-CHARM is fewer than the time of finding mGs after finding all closed itemsets(CHARM), especially in case of the the length of each frequent closed itemset is long. We also present applications ofmGs for mining (minimal) non-redundant association rules.

Keywords: MG-CHARM, frequent closed itemset, minimal generators, non-redundant association rules.

2.1.2. Support

x T. If an item iEI occurs in a transaction lET, we writeit as (i,l) E 8 or alternately as i 81.

The second transaction can be represented as {C82, 082,W82}.

Given transaction database 0 and an itemset X~ I . Thesupport of X in 0, noted a (X), is the number oftransactions that X occurs in D.

Vertical format

Item Transactions

A 1,3,4,5C 1, 2, 3, 4, 5, 6D 2,4,5,6T 1,3,5,6W 1,2,3,4,5

Transaction ContentID1 A,C,T,W2 C,D,W3 A,C,T,W4 A,C,D,W5 A,C,D, T, W6 C,D,T

Table 1. Example database =>

Example: consider the database in [4]

1. INTRODUCTION

At present, almost all algorithms for mining MinimalGenerators of FCIs are based on Apriori algorithm[3,5]. The first method was proposed in [3], the authorsextended from Apriori algorithm to find mGs. Firstly, itfound candidates that are mGs, then defined theirclosures to find out frequent closed itemsets. The secondmethod was represented by Zaki [5]. Firstly, he foundall frequent closed itemsets using CHARM algorithm[4]. Then he used level-wise method to find out all mGsthat correspond to each closed itemset. Both of thesemethods have disadvantage in large size of frequentitemsets since the number of considered candidates islarge.

In this paper, we propose a fast algorithm for miningmGs based on CHARM. It overcomes the disadvantageof above two methods by using CHARM to generateFCI and also find mGs of them following lemmas insection 2.4. Besides, we also present the applications ofmGs in mining non-redundant association rules.

2. FREQUENT ITEMSET AND MINIMALGENERATORS

2.1.3. Frequent itemset

2.1. Definitions X ~ I is called frequent itemset if a(X) ~ minSup (whereminSup is the value specified by user).

2.1.1. Transaction database2.1.4. Galois connection [5]

Let I = {i}, i2, ••• , in} be the set of all items. T = {t.,t2, ••. , tm } be the set of all transactions in transactiondatabase D. The given database is a binary relation 8~ I

Let the binary relation 8 ~ I x T contains database thatneed to mine. Let X ~ I and Y ~ T, peS) is the power setof S. We denote two mappings between pel) and P(l),

978-1-4244-4136-5/09/$25.00 ©2009 IEEE

1407

called Galois connection as follows:

i) t: P(l) 'r-7 peT), t(X) = {y E T IVx E X,xb' y}.ii) i: peT) 'r-7 P(l), iCy) = {x E I I Vy E Y,xb' y}.

Figure 1 shows these mappings in which the mappingt(X) is set of all transaction containing itemset X (calledas Tidset) and iCy) is the itemset occurred in alltransactions in Y. We can note itemset X and set ofcorrelative transactions t(X) as Xxt(X) and also called asIT-pair. Similar to itemset Yand i( Y) as i( Y)xY.Example: t(ACW) = teA) n t(C) n t(W)

=1345n 123456n 12345 = 1345i(245) = i(2) n iCC) n i(W)

=CDWn ACDWn ACDTW = CDW

!(X)

2.3. Minimal Generator (mG) 151

Let X is frequent closed itemset. X' t- 0 is called agenerator ofX if and only if:

1. X'~X.

2. cr(X) = cr(X').

Let G(X) denote the set of generator of X. We say thatX' E G(X) is a minimal generator if it has no subset inG(X). Let mGs(X) denote the set of all minimalgenerator ofX. By definition, mGs(X) * 0.

For example, given a frequent closed itemset X = ACTW,the generators are G(X) = {A7; TW, AC7; ATW, CTW}and mGs(X) = {AT, TW}.

2.4. Lemma 1 - properties of mGs

(l)mG(i) =i , Vi E I.

Figure 1. Galois connection [5]

I(Y) Y

(2)ln process of using CHARM algorithm for miningFCls, ifit satisfies property (1) then:

mGs(Xi uXj) = mGs(Xi) + mGs(Xj) .

(3)if it satisfies property (2) then:mGs(Xi uXj) = mGs(Xi).

2.1.5. Closure operator (4)

Let X ~ I and the mapping c: P(I) ~ P(7), where c(X) =

i(t(X)). The mapping c is called a closure operator.Example: Consider a given database in Table 1

c(AW) = i(t(AW)) = i(1345) = ACW.c(ACW) = i(t(ACW)) = i(1345) = ACW.

2.1.6. Frequent closed itemset

X c I is called a closed if and only if c(X) = X. X iscalled a frequent closed itemset if cr(X)~ minSup and itis closed.

Example: Consider a given database in Table 1Since c(AW) = i(t(AW)) = i(1345) = ACW

=> AW is not closed.Since c(ACW) = i(t(ACW)) = i(1345) = ACW

=> ACW is closed.

2.2. Properties oflT-pair 14)

Let Xixt(Xi) and Xjxt(Xj) be two IT-pairs. We have 4following properties:

(1) If t(Xi) = t(Xj) then c(Xi) = c(Xj) = c(XiuXj).

(2) If t(Xi) c t(Xj) then c(Xi) * c(Xj) but c(Xi) = c(Xiu Xj).

(3) If t(Xi) => t(Xj) then c(Xi) * c(Xj) but c(Xj) = c(XiuXj).

(4)lf t(Xi) t t(Xj) and t(Xj) t t(Xi) then c(Xi) * c(Xj) *c(XiuXj).

(4)if it satisfies property (3) then:mGs(Xi uXj) = mGs(Xj).

(5)ifit satisfies property (4) then:mGs(Xi uXj) = u[mGs(Xi), mGs(Xj)]

Proof: . .(1)Suppose that X' is a mG of X which contams one

single item, since X' t-0 and X' ~ X, so X' = X.

(2) We have t(Xi) = t(Xj) = t(mG(Xi)) = t(mG(Xj)) =

t(Xi uXj) => mGs(Xi) and mGs(Xj) are mGs of XiuXj.

(3) We have t(Xi) = t(XiuXj) = t(mG(Xi)) * t(mG(Xj))=> only mGs(Xi) are mGs ofXi uXj.

(4)Similar to 3.

(5)To proof the lemma 1.5, we must proof:

5.1. u[mGs(Xi), mGs(Xj)] are generators ofXiuXj.Indeed, consider Z = mGk(Xi) U mGtCXj) (wheremGiXi) EmG(Xi), mGtCXj)EmG(Xj)) we know: sincet(mGiXi)) = t(Xi) and t(mGtCXj)) = t(Xj) , so t(XiuXj) =

t(Z) => Z is the generator ofXiuXj.

5.2. u[mGs(Xi), mGs(Xj)] are minimal generators of XiuXj.Indeed, suppose at (m+1r level in the IT-tree, let Z =mGk(Xi) u mGtCXj) (where mGiXi)E mG(Xi) , mGtCXj);mG(Xj)) we have: Both mGk(Xi) and mGtCXj) are at m

1408

level and they have the same prefix, the (m-l )-itemset.Obviously, mG(X.) and mG(Aj) are the direct subsets ofZ, there is no any subset Y of mGk(~) u mGAAj) suchthat 1(Y) = 1(Z) or Z is a mG of Xi uAj.

3. MG-CHARM ALGORITHM

3.1. Algorithm

Input: The database D and support threshold minSupOutput: all FCI satisfy minSup and their mGMethod:

MG-CHARM(D, minSup)[0]={IjXl(li) ,(li): t,E 1 /\ cr(/i) :2: minSup}MG-CIIARM-EXTEND([0], C = 0 )Return C

MG-CHARM-EXTEND([P], C)for each lixI(lj),mG(li) in [PJ do

Pi= Pj U Ii and [PiJ = 0for each IjxI(lj),mG(Ij) in [PJ, with j > i do

X = Ij and Y= I(li) n 1(1j)MG-CHARM-PROPERTY(XxY,/i,lj,Pi, [Pi],[Pj)

SUBSUMPTION-CHECK(C, Pi)MG-CIIARM-EXTEND([Pi], C)

MG-CHARM-PROPERTY(Xx Y,li,lj,Pi, [Pi],[P])if cr(X):2: minSup then

if I(Ii) = I@ then II property IRemove Ij from PPi=Pi UIjmG(Pi) = mG(Pi) + mG(lj)

else if I(li) C I@ then II property 2Pi =Pj ulj

else if l(lj) :::J I(lj) then II property 3Remove Ij from [P]Add Xx Y,mG(Ij) to [Pi]

else if I(li)*-1(1j) then II property 4AddXxY, u [mG(l i), mG@]to [Pi]

Algorithm 1. MG-CHARM algorithm

The SUBSUMPTION-CHECK function checkswhether frequent itemset Pi is closed or not. If it is,add it into C, otherwise it will be removed and its mGswill be added to its parent closed. It uses hash-table tostore C, therefore the time of checking is insignificant[4].

3.2. Lemma 2

Finding mGs of frequent closed itemset X according tothe Algorithm 1 does not lost information.

Proof:Suppose that there if any frequent itemset X' suchthat:

i)X'~X

ii) cr(X') = cr(X)iii)Z<t:X', VZ E mGs(X)

It means that X' is a minimal generator of X but itdoes not belong to mGs(X).Since X' c X, so X' is not closed, therefore X' mustbe the generator of any frequent closed itemset Y.Since leX') = 1(X) and leX') = 1(Y), so 1(X) = leY).Following the property 1 of IT-pair, X, Yare notclosed since c(X) = c( Y) = c(Xu Y), it means that Ydoes not exist so X' does not exist.

3.3. Illustrations

Consider the database in Table 1 with minSup = 50%,we have some results:

mG(D)=(D);mG(T)=ln,mG(l-itemset)=(itself)

Figure 2. mG of 1-itemsets

Firstly, equivalence class {} contains frequent I-itemset.Following the lemma I: mG of 1-itemset is itself asshown in Figure 2.

Nextly, when joining two IT-pairs Dx2456 and Tx1356into DTx56, we have cr(DT) = 1561 = 2 < minSup ~ itdoes not satisfy, It is similar to DA.

And then , when joining two IT-pairs Dx2456 andTx1356 into DTx56 we have cr(DT) = 1561 = 2 < minSup

~ it does not satisfy. Similar to DA.

Consider DW, since leD) *- leW) and lemma 5, we havemG(DW) =mG(D) u mG(W) = DW.

Consider DC, since leD) C 1(C) and lemma 3, we havemG(DC) = mG(D).

0 ' 1/ 3456

Do t(D) = ~C) mG(DC)~- mG(D) - {O)

CD - (T) (A) \'/1 (e)

' 1tS6 ~,1l\6 ~ <1 :145 1, 1/ 345 C, 1/ :I456

Figure 3. Illustration of updating mG in case ofsatisfying lemma 1.3

To illustrate lemma 1.4, we consider this result in caseof joining T and IT-pairs behind it as shown in Figure4.

1409

Figure 4. The result after joining T and IT-pairs behindit

Consider the joining TAC and TWC, we have: sincet(TAC) = t(TWC) so mG(TAWC) = mG(TAC) +mG(TWC) = {TA, TW} (lemma 1.2). We have the finalresult tree as shown in Fi ure 5.

Figure 5. Searching IT-tree after doing MG-CHARMalgorithm

(mGs in the parenthesis belong to correlative closed set)

4. EXPIRIMENTS

Experiment results are done with many differentdatabases which were downloaded fromhttp ://fimi .cs.helsinki.fi/data . Application is coded byC#, Windows XP, CPU : IGHz, RAM : 384MB.

Table 2 Features oftest databasesDatabases #Trans #Items

Chess 3196 76Mushroom 8124 120Pumsb* 49046 7117Pumsb 49046 7117Connect 67557 130Accidents 340183 468

Table 3. Results of test with DBs between CHARM andMG-CHARM

minSupTime (s)

RateDatabases

(%)# Fe] CHARM MG-CHARM (I )/(2)

(I) (2)80 5083 5.52 0.24 23

75 11525 40.11 0.43 93.28Chess

70 23892 120.8 1.05 115.0565 49034 651.12 2.44 266.8540 140 0.32 0.25 1.2830 427 0.45 0.4 1.13

Mushroom20 1200 1.29 1.03 1.2510 4095 7.82 2.54 3.08

5 12940 56.03 5.55 10.1

94 216 0.69 0.66 1.0592 610 1.21 1.09 1.11

Pumsb90 1465 1.77 1.52 1.1688 3160 4.05 1.88 2.1560 68 0.69 0.65 1.0655 116 1.11 1.02 1.09

Pumsb*50 248 2.51 2.01 1.2545 713 4.72 3.62 1.3

95 811 1.32 1.16 1.14

90 3486 5.91 2.33 2.54Connect

85 8252 28.24 4.56 6.1980 15107 100.77 5.86 17.280 149 2.55 2.5 1.02

70 529 6.39 5.7 1.12Accidents

60 2074 16.21 15.18 1.0750 8057 49.34 31 1.59

(l) The time for mining FCI by CHARM + the timefor finding mG of FCI.

(2) The time for MG-CHARM.

The results in Table 3 show that the run-time ofMG-CHARM is fewer than the run-time for miningmG after mining FCI by CHARM. For example,consider the chess database with minSup = 65%,MG-CHARM faster 266 times. The larger number ofFCI we have, the higher rate it is.

5. RELATED WORKS AND APPLICATION OFmGs IN MINING NON-REDUNDANT

ASSOCIATION RULES

In [10], authors presented four categories of miningFCls algorithms. That is, only the first category(test-and-generate) mined mGs. This approach based ongenerated candidate to mine mGs and computed theirclosure to mine FCls.

Some applications of mGs in mmmg non-redundantassociation rules (NARs) based on methods presented in[3,5].

5.1. Method of Bastide et al

In this section, we present a method for mining minimalNARs (MINARs) based on FCls (Bastide et al [3]).Assume we had all FCls from database D satisfyminSup. Now, we want to mine MINARs. As wementioned above, MINARs only generate from X ~ Y,where X is a minimal generator and Y is a FCI.

Bastide et al divided mining MINARs problem into twosub-problems. Mining MINARs with the confidence =

100% and mining MINARs with the confidence <100%.

Phase 1: Mining MINARs with the confidence = 100%V g E mGs (in accending length), if g i- y(g) thengenarate the rule { g~ y(g) \ g} (y(g) is closure ofg).

1410

Phase 2: Mining MINARs with the confidence < 100%for k = 1 to f.!-1 do // f.! is longest FCls

forall g E mGs with Igi = k dogenerate rules from g ~ Y \ g, where g c Y and Yis a FCI (Y ty(g))

5.2. Method ofZaki

Zaki [5] mined the two kinds of rules. i) Rules with theconfidence = 100%: Self-rules (the rules that generatingfrom mGs(X) to X, where X is a FCI) and Down-rules(the rules that generating from mGs(Y) to mGs(X),where X, Yare FCls, X c V). ii) Rules with theconfidence < 100%: From mGs(X) to mGs(Y), where X,Y E FCls, X c Y. Number of rules were generated bythis approach is linear with FCls.

6. CONCLUSION AND FURTURE WORKS

In this paper, we have proposed a new method formining mGs of FCls needs not generate candidates.Experiments showed that the time of updating mGs offrequent closed itemsets is insignificant. Especially, incase of the large size of closed itemsets, the time ofupdating is very fewer than the method mentioned in [3,5].

In future, we will apply this achievement to theproblems of mining non-redundant association rules,non-redundant rules query.

REFERENCES

[1] R. Agrawal, R. Srikant: Fast algorithms for miningassociation rules. In: VLDB'94, pp(1994).

[2] R. Agrawal, T. Imielinski, A. Swami: Miningassociation rules between sets of items in large

databases. In: SIGMOD'93, pp. 207 - 216 (1993).[3] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, L.

Lakhal: Mining Minimal Non-RedundantAssociation Rules using Closed Frequent Itemssets.In: 1st International Conference on ComputationalLogic, pp. 972 - 986 (2000).

[4] M. 1. Zaki, C.J. Hsiao: Efficient Algorithms forMining Closed Itemsets and Their Lattice Structure.In: IEEE Transactions on Knowledge and DataEngineering, Vol. 17, No 4, April 2005, pp. 462-478(2005).

[5] M. 1. Zaki: Mining Non-Redundant AssociationRules, Data Mining and Knowledge Discovery, 9,223-248, 2004 Kluwer Academic Publishers.Manufactured in The Netherlands, pp. 223-248(2004).

[6] M. J. Zaki, K. Gouda: Fast Vertical Mining UsingDiffsets. In: Proc. Ninth ACM SIGKDD Int'l Conf.Knowledge Discovery and Data Mining, pp. 326 ­335 (2003).

[7] M. J. Zaki, B. Phoophakdee: MIRAGE: AFramework for Mining, Exploring, and VisualizingMinimal Association Rules. In: Technical Report03-4, Computer Science Dept., RensselaerPolytechnic Inst., July (2003).

[8] M. J. Zaki: Generating Non-Redundant AssociationRules. In: 6th ACM SIGKDD Intl Conf. KnowledgeDiscovery and Data Mining, (2000).

[9] M. 1. Zaki, M. Ogihana: Theoretical Foundations ofAssociation Rules. In: 3rd ACM SIGMOD Workshopon Research Issues in Data Mining and KnowledgeDiscovery (1998).

[10] S. B. Yahia, T. Hamrouni, E. M. Nguifo: FrequentClosed Itemset based Algorithms: A thoroughStructural and Analytical Survey. In: ACM SIGKDDExplorations Newsletter 8 (1), pp. 93 - 104 (2006).

1411