1 finding recent frequent itemsets adaptively over online data streams j. h, chang and w.s. lee, in...

18
1 Finding Recent Frequent Ite Finding Recent Frequent Ite msets Adaptively over Onlin msets Adaptively over Onlin e Data Streams e Data Streams J. H, Chang and W.S. Lee, in Proc. Of th J. H, Chang and W.S. Lee, in Proc. Of th e 9th ACM International Conference on Kn e 9th ACM International Conference on Kn owledge Discovery and Data Ming, 2003. owledge Discovery and Data Ming, 2003. Adviser: Jia-Ling Koh Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Speaker: Shu-Ning Shin Date: 2004.8.12 Date: 2004.8.12

Upload: hugo-owens

Post on 17-Dec-2015

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

1

Finding Recent Frequent ItemsetFinding Recent Frequent Itemsets Adaptively over Online Data Strs Adaptively over Online Data Streamseams

J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge DiscoverInternational Conference on Knowledge Discovery and Data Ming, 2003.y and Data Ming, 2003.

Adviser: Jia-Ling KohAdviser: Jia-Ling KohSpeaker: Shu-Ning ShinSpeaker: Shu-Ning ShinDate: 2004.8.12Date: 2004.8.12

Page 2: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

2

IntroductionIntroduction• This paper proposes a method of finding recen

t frequent itemsets :– Significant itemsets are maintained by a prefix-tree

lattice structure called monitoring lattice.– Decaying the old occurrence count of each itemset

as time goes by.– Minimize the number of significant itemsets :

• delayed-insertion• pruning operations

Page 3: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

3

Preliminaries (1)Preliminaries (1)• Data Stream can be defined :

– I={i1, i2, …, in} : a set of current items.– e : itemset, a set of item.– Tid : transaction id, Tk generate at the kth turn.– Dk=<T1, T2, …, Tk>, When new transaction Dk is gener

ated.– |D|k : the number of transactions in Dk.– Ck(e) : the number of transactions in Dk that contai

n the itemset e.– Sk(e) : Support of itemset e in Dk.

Page 4: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

4

Preliminaries (2)Preliminaries (2)• Decay rate: the reducing rate of a

weight for a fixed decay-unit.• d=b-(1/h), (b>1, h≧1, b-1≦d<1)

– decay-unit: the chunk of information to be decayed together.

– decay-base b: the amount of weight reduction per a decay-unit and greater than 1.

– decay-base-life h: defined by the number of decay-units that makes the current weight be b-1.

Page 5: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

5

Preliminaries (3)Preliminaries (3)• The total number of transactions |D|k in the cur

rent data stream Dk :

– The value of |D|k converges to 1/(1-d) as the value k increases infinitely.

• The count Ck(e) of an itemset e in the current data stream Dk :

2

1

1||

1||

1

kif

kif

dDD

kk

otherwise

TeifeWeWdeCeC k

kkkk

0

1)(),()()( 1

Page 6: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

6

Count Estimation of Count Estimation of an itemset (1)an itemset (1)

• The maximum possible count of an itemset is estimated by the minimum value among the maximum possible counts of all of its subsets.

Page 7: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

7

Count Estimation of Count Estimation of an itemset (2)an itemset (2)

• Definition 1 :– : a set of itemset e’s subsets– : a set of e’s m-subsets– : a set of counts for e’s m-subsets

• Definition 2 :– Union-itemset is composed of all items that are

members of either e1 or e2.– Intersection-itemset is composed of all items th

at are members of both e1 and e2.

)(ePCm

)(ePm

)(eP

21 ee

21 ee

Page 8: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

8

• exclusively distributed (LED) : the items of an itemset appear together in as many transactions as possible.

• most exclusively distributed (MED) : the items of an itemset appear exclusively as many transactions as possible.

• The maximum count of n-itemset e :

Count Estimation of Count Estimation of an itemset (3)an itemset (3)

))(min()( 1max ePeC C

n

Page 9: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

9

Count Estimation of Count Estimation of an itemset (4)an itemset (4)

• Two itemsets e1, e2 :

• The minimum count of Cmin(e) can be estimated by (n-1)-subset union :

• Estimation error :– E(e)=Cmax(e)-Cmin(e)

21

21

21

212121

min

|)|)()(,0max(

))()()(,0max()(

eeif

eeif

DeCeC

eeCeCeCeeC

}))(,|)(max({)( 1minmin( jiandePCeC njiji

eji

Page 10: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

10

estDecestDec Method (1) Method (1)• Every node in a monitoring lattice mainta

ins a triple (cnt, err, MRtid) for its corresponding itemset e :– cnt : count of e.– err : maximum error count of e– Mrtid : the most recent transacrion id that

contain e

Page 11: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

11

estDecestDec Method (2) Method (2)• estDec Method is composed of four phas

e :– Phase Ⅰ : parameter updating phase– Phase Ⅱ : count updating phase– Phase Ⅲ : Delayed insertion phase– Phase Ⅳ : frequent itemset selection phas

e

Page 12: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

12

estDecestDec Method (3) Method (3)

• Phase II : the counts of those itemsets in ML that appear in Tk are updated.– Sprn : threshold for pruning.– If a 1-itemset is pruned from ML, it is impossible to estimate it

s count later.

Phase I: |D|k is updated.

Page 13: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

13

estDecestDec Method (4) Method (4)

• Phase III : Find new itemset that has high possibility to become frequent. Two cases insert new itemset to a ML :– new 1-itemset, the cnt of 1-itemset is actual.– Itemset e Cmax(e)/|D|k ≧ Sins, Sins : threshold for delayed-inser

tion.• cntt_for_subsets=(1-d|e|-1)/(1-d)• max_xnt_before_subsets=Sins*(|D|k-(|e|-1))*d|e|-1)• Cupper(e)=Max_xnt_before_subsets+ Cntt_for_subsets

Page 14: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

14

estDecestDec Method (5) Method (5)

• Phase IV : produces all current frequent itemsets in ML.– itemset e is frequent if its current support

(cnt * d(k-MRtid))/|D|k is greater than Smin– its current support error :

• (err*d(k-MRtid))/|D|k

Page 15: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

15

estDecestDec Method (6) Method (6)• Force-pruning operation :

– all insignificant itemsets in ML can be pruned

– perform when the current size of ML reaches a threshold.

Page 16: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

16

Experimental (1)Experimental (1)• Performance of the estDec method for the data set T10.

I4.D1000K– Sins is denoted p%, the actual value=Smin*p%.– Force-pruning operation perform in every 1,000 transactions.– (a) memory usage (b) performance time of Phases I~III (c) perfo

rmance time of Phases IV

Page 17: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

17

Experimental (2)Experimental (2)• Accuracy of mining result

– Average support error

• ASE(RestDec|RdApriori)

Page 18: 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge

18

Experimental (3)Experimental (3)• The adaptability of the estDec method for the chang

e of information in a data stream.– Coverage rate CR(X)

• |R| : total nmber of frequent itemdets in ML