mining frequent itemsets in a stream toon calders, nele dexters, bart goethals icdm2007

22
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling 1

Upload: haroun

Post on 11-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007. Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling. OUTLINE. Introduction Problem Statement Property of Max-Frequency Algorithm Experiments Conclusion. INTRODUCTION. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

MINING FREQUENT ITEMSETS IN A STREAM

TOON CALDERS, NELE DEXTERS, BART GOETHALSICDM2007

Date: 5 June 2008

Speaker: Li, Huei-Jyun

Advisor: Dr. Koh, Jia-Ling

1

Page 2: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

OUTLINE

Introduction Problem Statement Property of Max-Frequency Algorithm Experiments Conclusion

2

Page 3: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

INTRODUCTION

Most previous work on mining frequently occurring itemsets over data streams either focuses on 1. The sliding window model2. The time-fading model3. The landmark model

Each of these models requires a fixed window length or decay factor given by the user

In many applications, however, choosing such parameters that are most appropriate for every itemset at every timepoint in an evolving stream is almost impossible 3

Page 4: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

INTRODUCTION

We propose to consider for each itemset the window in which it has the highest frequency We define the current frequency of an itemset as

the maximum over all windows from the past until the current state that satisfy a minimal size constraint

When a stream evolves, the length of the window containing the highest frequency for a given itemset can change continuously

This new stream measure turns out to be very suitable to early detect sudden bursts of occurrences of itemsets, while still taking into account the history of the itemset

4

Page 5: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

PROBLEM STATEMENT * STREAMS AND MAX-FREQUENCY

: a stream 〈 I1 I2 … In 〉 is a sequence of itemsets is the length of the stream I1 is considered the first and oldest itemset in the

stream, and In the latest and most recent

: the number of sets in a stream that contain itemset I

: the sub-stream of the window 〈 Is Is+1 … It〉

: the sub-stream of consisting of the last k items of ,

5

Page 6: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

PROBLEM STATEMENT * STREAMS AND MAX-FREQUENCY

Definition 1. Given a minimal window size mwl, the max-

frequency of itemset I in a stream is defined as the maximum of the frequencies of I over all windows, of size at least mwl, extending the end of the stream; that is

If the length of the stream is less than mwl, the max-frequency is defined to be 0

6

Page 7: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

PROBLEM STATEMENT * STREAMS AND MAX-FREQUENCY

Definition 1. (cont.) The longest window in which the maximum

frequency is reached is called the maximal window for I in , and its starting point is denoted

That is, is the smallest index such that

mwl will be omitted when clear form the context

7

Page 8: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

PROPERTIES OF MAX-FREQUENCY

8

Page 9: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

PROPERTIES OF MAX-FREQUENCY

9

Page 10: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * THE SUMMARY

Let p1 < p2 < … < pr be the borders for itemset A in the stream , ordered from oldest to most recent

Let be the number of occurrences of the target itemset A in between two subsequent border positions pi and pi+1( for i = 1, …, r-1 ). Denotes the number of occurrences of A since the last border

The summary St of is defined as the array

10

Page 11: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * THE SUMMARY

We can easily compute the frequencies of itemset A for any of the border positions form this summary:

11

Page 12: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * THE SUMMARY

The fractions in the blocks in between two subsequent border positions are increasing, and as a consequence, among all borders pi, we have that is maximal for i equal to r

12

Page 13: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * THE SUMMARY

13

Page 14: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * MINIMAL FREQUENCY

Until now, we assumed that for the target itemset we need to be able to report its frequency exactly. We will now relax this requirement by setting a minimal frequency threshold minfreq

Let be a stream with , and suppose that

Then we can remove( p1, a1 ) from the left-side of the summary 14

Page 15: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * MINIMAL WINDOW LENGTH

In the algorithm without minimal window length, a border q in stream can be pruned of we can find two blocks and such that the frequency of the target in is higher than

When we are working with a minimal window length, it could be the case that the suffix of the stream starting at r + 1 does not meet the minimal window length requirement In that case, even though the window starting at q

has lower frequency than the window starting r + 1, it can still have the highest frequency of all windows that meet the minimal window requirement!

15

Page 16: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * MINIMAL WINDOW LENGTH

16

Page 17: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * MINIMAL WINDOW LENGTH

In order to know the maximal frequency with a minimal window length mwl, it suffices to apply the method without any minimal window length to keep track of the borders for the stream

Then, when we need the max-frequency, we check the borders of in the complete stream , and the minimal window itself,

17

Page 18: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * MINING ALL ITEMSETS

18

Page 19: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

ALGORITHM * MINING ALL ITEMSETS

We do not need to maintain the summaries of all itemsets, but only those that were once frequent in the minimal window, and that are, at the same time, frequent now within the part of the stream

Furthermore, we need to find the frequent itemsets in the mwl windows

19

Page 20: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

EXPERIMENTS

20

Page 21: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

EXPERIMENTS

21

Page 22: MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

CONCLUSION

We presented a new frequency measure for itemsets in streams that does not rely on a fixed window length or a time-decaying factor

An experimental evaluation supported the claim that the new measure can be computed from a summary with extremely small memory requirements, that can be maintained and updated efficiently

The summary of the stream consists of the borders and their corresponding frequencies

22