1 efficient algorithms for incremental update of frequent sequences minghua zhang dec. 7, 2001
TRANSCRIPT
1
Efficient Algorithms for Incremental Update of Frequent Sequences
Minghua ZHANG
Dec. 7, 2001
2
Content
Introduction Problem Definition Related Works Incremental Update Algorithms Performance Conclusion
3
Introduction
In our life, sequences exist in many areas.– An on-line bookstore: customer’s buying sequences– Web site: web-log sequences
The knowledge of frequent sequences is useful. Some algorithms have been proposed, such as
AprioriAll, GSP, SPADE, MFS and PrefixSpan. These algorithms assume the database is static. In practice, the content of a sequence database
changes continually.
4
Problem Definition
Item– I={i1, i2, … , iM}: a set of literals called items.
Transaction (or Itemset)– Transaction t: a set of items such that t I.
Sequence– Sequence s=< t1, t2, … , tn>: a set of ordered transactions.
– The length of s (represented by |s|) is defined as the number of items contained in s. E.g. if s=<{1},{2,3},{1,4}>, then |s|=5.
5
Problem Definition
Subsequence– s1=<a1, a2, …, am> , s2=<b1, b2, …, bn>– If there exist integers j1, j2, …, jn
1 j1 <j2 <… <jn m b1 aj1 , b2 aj2
, …, bn ajn
– s2 is a subsequence of s1, or s1 contains s2 ( represented by s2 s1).
– Example: If s1=< {1}, {2,3}, {4,5}>, s2=<{2}, {5}>, then s2 s1. Maximal Sequence
– Given a sequence set V, a sequence s in V is maximal : if s is not a subsequence of any other sequence in V.
6
Problem Definition
Given a sequence database D and a sequence s– support count: the number of sequences in D that contain s.– support: the fraction of sequences in D that contain s.– frequent: the support of s is no less than a threshold s.
Mining Frequent Sequences– Inputs:
a database D of sequences a user specified minimum support threshold s(e.g. s =1%)
– Output: maximal frequent sequences
7
Problem Definition
Database update
Incremental Update– Inputs:
- D- + s Frequent sequences in D and their supports
– Output: Maximal frequent sequences in D’
8
Problem Definition
Notations
9
Related Works--GSP
GSP is put forward by Srikant and Agrawal (EDBT 96).
10
Related Works--GSP
Candidate Generation Function GGen()– Input: Li
– Output: Ci+1
– Join: for each pair of sequences s1, s2 Li
If the sequence got by deleting the first item in s1 = the sequence got by deleting the last item in s2 (or vice versa), then a candidate sequence is generated and inserted into Ci+1.
E.g: if s1=<{1,2,3}>, s2=<{2,3},{4}> ( s’=<{2,3}> ), then c1=<{1,2,3},{4}> is generated.
– Prune: if a sequence s in Ci+1, has infrequent subsequences, then delete s from Ci+1.
Reason: If a sequence is frequent, then all its subsequences must be frequent.
11
Related Works--MFS
The I/O cost of GSP is high in some cases. MFS tries to reduce the I/O cost needed by GSP (IC-AI
2001).– Make use of a suggested frequent sequence set Sest
Mine a sample of the database using GSP Results of the previous mining action
– Generalize the candidate generation function of GSP Its input: frequent sequences of various lengths Its output: candidate sequences of various lengths
– Longer sequences can be generated and counted early, therefore MFS reduces I/O cost.
12
Related Works--MFS
MFS algorithm
13
Incremental Update Algorithms
It is inefficient to apply GSP and MFS to mine the new database from scratch.
– Information available: frequent sequences in D and their supports
Basic Idea:– If a sequence s is frequent in D, then its support count in D’
can be deduced by scanning - and +, without D-.– If a sequence s is infrequent in D, then it cannot be frequent in
D’ unless its support count in + is large enough its support count in - is small enough
14
Incremental Update Algorithms
Mathematical formulae:– = - +
– Define (s’ is a subsequence of s), then is an upper bound of . – Lemma 1: For a sequence s to be frequent in D’, the following formula must be true:
– Lemma 2: If a sequence s is infrequent in D but frequent in D’, the following formula must be true:
15
Incremental Update Algorithms
Algorithms GSP+ and MFS+– Structures are similar as those of GSP and MFS– Difference: each time after generating candidates, use the 2
lemmas to delete some candidates by scanning - and/or + when necessary.
For a frequent sequence s in D, we know , apply lemma 1.
For an infrequent sequence s in D, we don’t know , apply lemma 2.
– CPU saving is achieved by avoiding processing D- for some candidates.
16
Performance
Synthetic dataset – Parameter of the dataset
Parameter Description Value
| D | Number of customers 1,500,000
| C | Average number of transactions per customer 10
| T | Average number of items per transaction 2.5
| S | Average No. of itemsets in maximal potentially frequent sequences 4
| I | Average size of itemsets in maximal potentially frequent sequences 1.25
Ns Number of maximal potentially frequent sequences 5,000
NI Number of maximal potentially frequent itemsets 25,000
N Number of items 10,000
17
Performance
Comparison of four algorithms under different support thresholds
– |D| = |D’|=1,500,000, |+| = |-| = 150,000 = 10% |D| s = 0.35%--0.65%
18
Performance
Comparison of four algorithms under different support thresholds
– GSP+ and MFS+ need less CPU time.
– GSP+ and MFS+ usually require a little more I/O cost due to the processing of -, which is not required by GSP and MFS.
– MFS-based algorithms perform better especially in I/O cost. Use old frequent sequences as Sest
– MFS+ is the overall winner in terms of both CPU and I/O costs.
s 0.35% 0.4% 0.45% 0.5% 0.55% 0.6% 0.65%
Total No. of candidates 34,065 18,356 10,024 5,812 3,365 2,053 1,160
Those require scanning D- 13,042 7,161 3,966 2,313 1,353 867 509
Percentage (row3/row2) 38% 39% 40% 40% 40% 42% 44%
19
Performance
Varying |+| and |-| s = 0.5%, |D| = |D’|=1,500,000
– |+| = |-| change from 1% to 40% of |D|
20
Performance
Varying |+| and |-|– The CPU costs of GSP and MFS stay relatively
steady. GSP and MFS deal with D’ only, while |D’| doesn’t change.
– The CPU cost of GSP+ and MFS+ increase linearly with |+| and |-| .
GSP+ and MFS+ need more time to process |+| and |-|.
– MFS+ is the most CPU-efficient algorithm when |+| = |-| is less than 25% of |D|.
21
Conclusion
GSP+ and MFS+ outperform their non-incremental counterparts in CPU cost at the expense of a small penalty in I/O cost.
The MFS-based algorithms perform better than the GSP-based ones, particularly in I/O cost.
The performance gains of GSP+ and MFS+ are the most prominent when the changed part of the database is small compared with the unchanged part.
22
The End
?