1 efficient algorithms for incremental update of frequent sequences minghua zhang dec. 7, 2001

22
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

Upload: erick-may

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

1

Efficient Algorithms for Incremental Update of Frequent Sequences

Minghua ZHANG

Dec. 7, 2001

Page 2: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

2

Content

Introduction Problem Definition Related Works Incremental Update Algorithms Performance Conclusion

Page 3: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

3

Introduction

In our life, sequences exist in many areas.– An on-line bookstore: customer’s buying sequences– Web site: web-log sequences

The knowledge of frequent sequences is useful. Some algorithms have been proposed, such as

AprioriAll, GSP, SPADE, MFS and PrefixSpan. These algorithms assume the database is static. In practice, the content of a sequence database

changes continually.

Page 4: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

4

Problem Definition

Item– I={i1, i2, … , iM}: a set of literals called items.

Transaction (or Itemset)– Transaction t: a set of items such that t I.

Sequence– Sequence s=< t1, t2, … , tn>: a set of ordered transactions.

– The length of s (represented by |s|) is defined as the number of items contained in s. E.g. if s=<{1},{2,3},{1,4}>, then |s|=5.

Page 5: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

5

Problem Definition

Subsequence– s1=<a1, a2, …, am> , s2=<b1, b2, …, bn>– If there exist integers j1, j2, …, jn

1 j1 <j2 <… <jn m b1 aj1 , b2 aj2

, …, bn ajn

– s2 is a subsequence of s1, or s1 contains s2 ( represented by s2 s1).

– Example: If s1=< {1}, {2,3}, {4,5}>, s2=<{2}, {5}>, then s2 s1. Maximal Sequence

– Given a sequence set V, a sequence s in V is maximal : if s is not a subsequence of any other sequence in V.

Page 6: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

6

Problem Definition

Given a sequence database D and a sequence s– support count: the number of sequences in D that contain s.– support: the fraction of sequences in D that contain s.– frequent: the support of s is no less than a threshold s.

Mining Frequent Sequences– Inputs:

a database D of sequences a user specified minimum support threshold s(e.g. s =1%)

– Output: maximal frequent sequences

Page 7: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

7

Problem Definition

Database update

Incremental Update– Inputs:

- D- + s Frequent sequences in D and their supports

– Output: Maximal frequent sequences in D’

Page 8: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

8

Problem Definition

Notations

Page 9: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

9

Related Works--GSP

GSP is put forward by Srikant and Agrawal (EDBT 96).

Page 10: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

10

Related Works--GSP

Candidate Generation Function GGen()– Input: Li

– Output: Ci+1

– Join: for each pair of sequences s1, s2 Li

If the sequence got by deleting the first item in s1 = the sequence got by deleting the last item in s2 (or vice versa), then a candidate sequence is generated and inserted into Ci+1.

E.g: if s1=<{1,2,3}>, s2=<{2,3},{4}> ( s’=<{2,3}> ), then c1=<{1,2,3},{4}> is generated.

– Prune: if a sequence s in Ci+1, has infrequent subsequences, then delete s from Ci+1.

Reason: If a sequence is frequent, then all its subsequences must be frequent.

Page 11: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

11

Related Works--MFS

The I/O cost of GSP is high in some cases. MFS tries to reduce the I/O cost needed by GSP (IC-AI

2001).– Make use of a suggested frequent sequence set Sest

Mine a sample of the database using GSP Results of the previous mining action

– Generalize the candidate generation function of GSP Its input: frequent sequences of various lengths Its output: candidate sequences of various lengths

– Longer sequences can be generated and counted early, therefore MFS reduces I/O cost.

Page 12: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

12

Related Works--MFS

MFS algorithm

Page 13: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

13

Incremental Update Algorithms

It is inefficient to apply GSP and MFS to mine the new database from scratch.

– Information available: frequent sequences in D and their supports

Basic Idea:– If a sequence s is frequent in D, then its support count in D’

can be deduced by scanning - and +, without D-.– If a sequence s is infrequent in D, then it cannot be frequent in

D’ unless its support count in + is large enough its support count in - is small enough

Page 14: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

14

Incremental Update Algorithms

Mathematical formulae:– = - +

– Define (s’ is a subsequence of s), then is an upper bound of . – Lemma 1: For a sequence s to be frequent in D’, the following formula must be true:

– Lemma 2: If a sequence s is infrequent in D but frequent in D’, the following formula must be true:

Page 15: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

15

Incremental Update Algorithms

Algorithms GSP+ and MFS+– Structures are similar as those of GSP and MFS– Difference: each time after generating candidates, use the 2

lemmas to delete some candidates by scanning - and/or + when necessary.

For a frequent sequence s in D, we know , apply lemma 1.

For an infrequent sequence s in D, we don’t know , apply lemma 2.

– CPU saving is achieved by avoiding processing D- for some candidates.

Page 16: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

16

Performance

Synthetic dataset – Parameter of the dataset

Parameter Description Value

| D | Number of customers 1,500,000

| C | Average number of transactions per customer 10

| T | Average number of items per transaction 2.5

| S | Average No. of itemsets in maximal potentially frequent sequences 4

| I | Average size of itemsets in maximal potentially frequent sequences 1.25

Ns Number of maximal potentially frequent sequences 5,000

NI Number of maximal potentially frequent itemsets 25,000

N Number of items 10,000

Page 17: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

17

Performance

Comparison of four algorithms under different support thresholds

– |D| = |D’|=1,500,000, |+| = |-| = 150,000 = 10% |D| s = 0.35%--0.65%

Page 18: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

18

Performance

Comparison of four algorithms under different support thresholds

– GSP+ and MFS+ need less CPU time.

– GSP+ and MFS+ usually require a little more I/O cost due to the processing of -, which is not required by GSP and MFS.

– MFS-based algorithms perform better especially in I/O cost. Use old frequent sequences as Sest

– MFS+ is the overall winner in terms of both CPU and I/O costs.

s 0.35% 0.4% 0.45% 0.5% 0.55% 0.6% 0.65%

Total No. of candidates 34,065 18,356 10,024 5,812 3,365 2,053 1,160

Those require scanning D- 13,042 7,161 3,966 2,313 1,353 867 509

Percentage (row3/row2) 38% 39% 40% 40% 40% 42% 44%

Page 19: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

19

Performance

Varying |+| and |-| s = 0.5%, |D| = |D’|=1,500,000

– |+| = |-| change from 1% to 40% of |D|

Page 20: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

20

Performance

Varying |+| and |-|– The CPU costs of GSP and MFS stay relatively

steady. GSP and MFS deal with D’ only, while |D’| doesn’t change.

– The CPU cost of GSP+ and MFS+ increase linearly with |+| and |-| .

GSP+ and MFS+ need more time to process |+| and |-|.

– MFS+ is the most CPU-efficient algorithm when |+| = |-| is less than 25% of |D|.

Page 21: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

21

Conclusion

GSP+ and MFS+ outperform their non-incremental counterparts in CPU cost at the expense of a small penalty in I/O cost.

The MFS-based algorithms perform better than the GSP-based ones, particularly in I/O cost.

The performance gains of GSP+ and MFS+ are the most prominent when the changed part of the database is small compared with the unchanged part.

Page 22: 1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001

22

The End

?