1 efficient algorithms for incremental update of frequent sequences minghua zhang dec. 7, 2001

1

Efficient Algorithms for Incremental Update of Frequent Sequences

Minghua ZHANG

Dec. 7, 2001

2

Content

Introduction Problem Definition Related Works Incremental Update Algorithms Performance Conclusion

3

Introduction

In our life, sequences exist in many areas.– An on-line bookstore: customer’s buying sequences– Web site: web-log sequences

The knowledge of frequent sequences is useful. Some algorithms have been proposed, such as

AprioriAll, GSP, SPADE, MFS and PrefixSpan. These algorithms assume the database is static. In practice, the content of a sequence database

changes continually.

4

Problem Definition

Item– I={i1, i2, … , iM}: a set of literals called items.

Transaction (or Itemset)– Transaction t: a set of items such that t I.

Sequence– Sequence s=< t1, t2, … , tn>: a set of ordered transactions.

– The length of s (represented by |s|) is defined as the number of items contained in s. E.g. if s=<{1},{2,3},{1,4}>, then |s|=5.

5

Problem Definition

Subsequence– s1=<a1, a2, …, am> , s2=<b1, b2, …, bn>– If there exist integers j1, j2, …, jn

1 j1 <j2 <… <jn m b1 aj1 , b2 aj2

, …, bn ajn

– s2 is a subsequence of s1, or s1 contains s2 ( represented by s2 s1).

– Example: If s1=< {1}, {2,3}, {4,5}>, s2=<{2}, {5}>, then s2 s1. Maximal Sequence

– Given a sequence set V, a sequence s in V is maximal : if s is not a subsequence of any other sequence in V.

6

Problem Definition

Given a sequence database D and a sequence s– support count: the number of sequences in D that contain s.– support: the fraction of sequences in D that contain s.– frequent: the support of s is no less than a threshold s.

Mining Frequent Sequences– Inputs:

a database D of sequences a user specified minimum support threshold s(e.g. s =1%)

– Output: maximal frequent sequences

7

Problem Definition

Database update

Incremental Update– Inputs:

- D- + s Frequent sequences in D and their supports

– Output: Maximal frequent sequences in D’

8

Problem Definition

Notations

9

Related Works--GSP

GSP is put forward by Srikant and Agrawal (EDBT 96).

10

Related Works--GSP

Candidate Generation Function GGen()– Input: Li

– Output: Ci+1

– Join: for each pair of sequences s1, s2 Li

If the sequence got by deleting the first item in s1 = the sequence got by deleting the last item in s2 (or vice versa), then a candidate sequence is generated and inserted into Ci+1.

E.g: if s1=<{1,2,3}>, s2=<{2,3},{4}> ( s’=<{2,3}> ), then c1=<{1,2,3},{4}> is generated.

– Prune: if a sequence s in Ci+1, has infrequent subsequences, then delete s from Ci+1.

Reason: If a sequence is frequent, then all its subsequences must be frequent.

11

Related Works--MFS

The I/O cost of GSP is high in some cases. MFS tries to reduce the I/O cost needed by GSP (IC-AI

2001).– Make use of a suggested frequent sequence set Sest

Mine a sample of the database using GSP Results of the previous mining action

– Generalize the candidate generation function of GSP Its input: frequent sequences of various lengths Its output: candidate sequences of various lengths

– Longer sequences can be generated and counted early, therefore MFS reduces I/O cost.

12

Related Works--MFS

MFS algorithm

13

Incremental Update Algorithms

It is inefficient to apply GSP and MFS to mine the new database from scratch.

– Information available: frequent sequences in D and their supports

Basic Idea:– If a sequence s is frequent in D, then its support count in D’

can be deduced by scanning - and +, without D-.– If a sequence s is infrequent in D, then it cannot be frequent in

D’ unless its support count in + is large enough its support count in - is small enough

14


Mathematical formulae:– = - +

– Define (s’ is a subsequence of s), then is an upper bound of . – Lemma 1: For a sequence s to be frequent in D’, the following formula must be true:

– Lemma 2: If a sequence s is infrequent in D but frequent in D’, the following formula must be true:

15


Algorithms GSP+ and MFS+– Structures are similar as those of GSP and MFS– Difference: each time after generating candidates, use the 2

lemmas to delete some candidates by scanning - and/or + when necessary.

For a frequent sequence s in D, we know , apply lemma 1.

For an infrequent sequence s in D, we don’t know , apply lemma 2.

– CPU saving is achieved by avoiding processing D- for some candidates.

16

Performance

Synthetic dataset – Parameter of the dataset

Parameter Description Value

| D | Number of customers 1,500,000

| C | Average number of transactions per customer 10

| T | Average number of items per transaction 2.5

| S | Average No. of itemsets in maximal potentially frequent sequences 4

| I | Average size of itemsets in maximal potentially frequent sequences 1.25

Ns Number of maximal potentially frequent sequences 5,000

NI Number of maximal potentially frequent itemsets 25,000

N Number of items 10,000

17

Performance

Comparison of four algorithms under different support thresholds

– |D| = |D’|=1,500,000, |+| = |-| = 150,000 = 10% |D| s = 0.35%--0.65%

18

Performance

Comparison of four algorithms under different support thresholds

– GSP+ and MFS+ need less CPU time.

– GSP+ and MFS+ usually require a little more I/O cost due to the processing of -, which is not required by GSP and MFS.

– MFS-based algorithms perform better especially in I/O cost. Use old frequent sequences as Sest

– MFS+ is the overall winner in terms of both CPU and I/O costs.

s 0.35% 0.4% 0.45% 0.5% 0.55% 0.6% 0.65%

Total No. of candidates 34,065 18,356 10,024 5,812 3,365 2,053 1,160

Those require scanning D- 13,042 7,161 3,966 2,313 1,353 867 509

Percentage (row3/row2) 38% 39% 40% 40% 40% 42% 44%

19

Performance

Varying |+| and |-| s = 0.5%, |D| = |D’|=1,500,000

– |+| = |-| change from 1% to 40% of |D|

20

Performance

Varying |+| and |-|– The CPU costs of GSP and MFS stay relatively

steady. GSP and MFS deal with D’ only, while |D’| doesn’t change.

– The CPU cost of GSP+ and MFS+ increase linearly with |+| and |-| .

GSP+ and MFS+ need more time to process |+| and |-|.

– MFS+ is the most CPU-efficient algorithm when |+| = |-| is less than 25% of |D|.

21

Conclusion

GSP+ and MFS+ outperform their non-incremental counterparts in CPU cost at the expense of a small penalty in I/O cost.

The MFS-based algorithms perform better than the GSP-based ones, particularly in I/O cost.

The performance gains of GSP+ and MFS+ are the most prominent when the changed part of the database is small compared with the unchanged part.

22

The End

?

1 efficient algorithms for incremental update of frequent sequences minghua zhang dec. 7, 2001

Documents