mining top k frequent closed itemsets

Mining top-k frequent closed itemsets over data streams using the sliding window

model

Author: Pauray S.M TsaiPublication: ESA 2010Presenter: Yuan-Chung Chang

2

Outline

Introduction Motivation Mining top-k frequent closed itemsets

FCI_max algorithm Example for FCI_max algorithm Conclusion

3

Introduction

With the emergence of new applications, the data we process are not again static, but the continuous dynamic data stream.

Because the data in streams come with high speed and are continuous and unbounded, there are three challenges for data stream mining. First, each item in a stream could be examined only once. Second, although the data are generated continuously, the

memory space could be used is limited. Third, the mining result should be generated as fast as

possible.

4

Introduction (cont.)

In the database community, one of the major applications is mining association rules in large transaction databases.

There are two problems occurring in traditional association rule mining. First, a minimum support is required for mining. Second, there are usually a lot of association rules

generated from the mining, which gives rise to difficulties in practical applications.

5

Introduction (cont.) In the data stream environment, the problem of mining

frequent itemsets becomes more complicated.

Traditional algorithms for mining frequent itemsets cannot satisfy the requirement of examining each item in a stream only once. How to effectively maintain frequent itemsets over data

streams is another important issue.

Because data are generated continuously in data streams, present frequent itemsets may become infrequent, and present infrequent itemsets may become frequent.

We cannot save all the itemsets and their related information in the memory due to the restriction of memory space.

6

Introduction (cont.)

The time models for data stream mining mainly include the landmark model (2002), the tilted-time window model (2003) and the sliding window model (2006). The landmark model considers all the data from a specified

point of time to the current time. The tilted-time window model is a variation of the

landmark model. The sliding window model focuses on the recent data from

the current moment back to a specified time point.

7

Motivation The two problems occurring in traditional association rule

mining also exist in the data stream environment: specifying an appropriate minimum support and reducing the number of frequent itemsets.

The idea of mining frequent closed itemsets was first proposed in 1999.

8

Motivation (cont.) An alternative approach for mining top-k frequent closed

itemsets of length no less than min_l without specifying the minimum support was proposed in 2005. The mining result only presents frequent closed itemsets of

length no less than min_l, resulting in the loss of information about closed itemsets with high support but short length.

In fact, the longer the length of a closed itemset is, the smaller the support of it will be.

In this paper, the author proposes an efficient single pass algorithm, FCI_max, to discover top-k frequent closed itemsets of length no more than max_l, using a sliding window technique.

9

Motivation (cont.)

For mining top-k frequent closed itemsets of length no less than min_l (2005) Case 1: Mining top-3 frequent closed itemsets with min_l = 2.

• The mining result is {ab:7, abc:6, ad:4}. {a:8}

Case 2: Mining top-3 frequent closed itemsets with min_l = 3.• The mining result is {abc:6, abcd:3, abe:2, ace:2}. {a:8},{ab:7},{ad:4}

10

Motivation (cont.)

For mining top-k frequent closed itemsets of length no more than max_l (2010) Case 3: Mining top-4 frequent closed itemsets with max_l = 3.

• The mining result is {a:8, ab:7, abc:6, ad:4}.

Case 4: Mining top-4 frequent closed itemsets with max_l = 2.• The mining result is {a:8, ab:7, ad:4, ae:3}.

11

Mining top-k frequent closed itemsets

The auther use the sliding window model shown in Fig. 1 for the following discussion.

12


The number of windows: n The time covered by each window: t Items in window: {x1,x2, . . . , xm}

The sliding windows: {Wi1,Wi2, . . . ,Win}

The set of identifiers of transactions containing itemset {x1,x2m, . . . , xm} in window Wij: SPij({x1,x2, . . . , xm})

The union of SPij({x1,x2m, . . . , xm}): CSi({x1,x2, . . . , xm})

The number of transaction identifiers in CSi({x1,x2m, . . . , xm}): CSi({x1,x2, . . . , xm})

The top-k 1-itemsets by CSi: {S1,S2, . . . ,Sk}

The current top-k frequent closed itemsets are denoted as a set: P The initial value of P is set to {S1,S2, . . . ,Sk}

13


The detailed algorithm for mining top-k frequent closed itemsets with max_l FCI_max algorithm

14


15

Example for FCI_max algorithm

Assume the number of windows is 4 and the size of a window is 5 minutes.

Assume the number of given frequent closed itemsets is 5 and the maximum length of frequent closed itemsets is 4.

34

Conclusion

In this paper, the auther proposes an efficient single pass algorithm, FCI_max, to discover top-k frequent closed itemsets of length no more than max_l.

The method of using the maximum length to replace with the minimum support resolves the problem of losing information about itemsets with short length but high support.

FCI_max algorithm needs not to store all the support counts of itemsets at each time point.

It utilizes a technique of dynamic computation to generate all the frequent closed itemsets and their related information, which efficiently discovers top-k frequent closed itemsets under the data stream environment.

www.themegallery.com

Thank youfor your listening

Q & A

mining top k frequent closed itemsets

Technology

frequent closed itemsets

number of frequent itemsets

present frequent itemsets

mining result

data stream mining

introduction motivation

present infrequent itemsets

data streams