mining sequential patterns dimitrios gunopulos, ucr
Post on 19-Dec-2015
216 views
TRANSCRIPT
Finding Frequent Sequential Patterns
• The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently.
• For example: A followed by B (A->B), or A followed by C (A->C) The patterns should occur within a window W.
• Applications in telecommunication data, networks, biology
Sequences
• Sequence
((T=90F) (H=60%, P=1.1atm))
time t1 Later time t2
attribute value
item
itemitemitemset
• k-sequence: sequence with k items
• T1H2P1T3P2, P1T2H4P2T5: 5-sequences
• S1 is subsequence of S2 (S1 S2)
• T1P1T2 H1T1P2H2P1T2 (T1H1T1 , P1T2H2P1T2)
• H1P1T2 H1T1P2H2P1T2
Sequential Patterns: The Problem
• support or frequency of a sequence S ((S)):
• = the total number of times sequence S is encountered
• user specified minimum threshold min_sup
• S is frequent (S) min_sup
• S:maximal frequent sequenceS is frequent and all of its supersequences are non-frequent
• S:minimal non-frequent sequenceS is non frequent and all of its subsequences are frequent
• The problem
• Given: database D and min_sup
• the problem: find all frequent sequences in D
Algorithms for Sequential Patterns
• Apriori, GSP[Srikant, Agrawal, EDBT 1996][Mannila, Toivonen, Verkamo, DMKD 1997]
• SPADE, Parallel Spade [Zaki, 2001]• FreeSpan, PrefixSpan
[Han et al, SIGKDD 2000], [Pei et al, ICDE 2001]• Sequential Patterns with constraints
[Garofalakis et al, VLDB 99]• DFS-Mine [Tsoukatos and Gunopulos, SSTD 2001]
SPADE ([Zaki, 2001])
• Lattice-based approach
• vertical id-list format
• enumerates all frequent sequences equivalence classes to decompose the problem:
• two k-sequences belong in the same []i class if they have the same i-length prefix
• each class fits in main memory
• generates a (k+1)-sequence by intersecting two k-sequences that have common (k-1)-length prefix
• minimizes I/O cost - 2 database scans:
• frequent 1-sequences, frequent 2-sequences
DFS_MINE
• Depth-First-Search approach• fast discoveries of long
maximal frequent patterns• uses minimal amount of
memory• some frequent sequences are
deduced to be frequent from lattice
• candidate (k+1)-sequence: intersect a k-sequence with all frequent items (FreqItems)
• in main memory:• S.Useless: set of items sequence S must not be intersected with
• MaxFreqList: List of Maximal Frequent Sequences
• MinNonFreqList: List of Minimal Non Frequent Sequences
• scan database to determine the support of candidate sequences
BCD
CD In MinNonFreqList
candidate
BCD
ABCDE
In MaxFreqList
candidate
MaxFreqList - MinNonFreqListLemma: All subsequences of a frequent sequence are also frequent
• Supersequences S is inserted in MinNonFreqList if:
• S is not in MinNonFreqList
• S is not a supersequence of a sequence in MinNonFreqList
• S was scanned in database and was found to be non-frequent
• Supersequences of S in MinNonFreqList are removed.
• S is inserted in MaxFreqList if:
• S is not in MaxFreqList
• S is not a subsequence of a sequence in MaxFreqList
• S was scanned in database and was found to be frequent
• Subsequences of S in MaxFreqList are removed.
Examining Candidate Sequences
• k-sequence S is intersected with all items Ij in FreqItems-S.Useless
• resulting sets SET(S+Ij) for all Ij
• each sequence S:
• check MinNonFreqList
• check MaxFreqList
• scan database for all unknown sequences (if any) in SET(S+Ij) for all Ij
(1pass)
• update MaxFreqList, MinNonFreqList
Generating sequences• k-sequence S + Ij in FreqItems-S.Useless = candidate (k+1)-sequences
ABCD + E1. EABCD 2. AEBCD 3. AEBCD 4. ABECD 5. ABECD 6. ABCED 7. ABCED 8. ABCDE 9. ABCDE
ABCD + D1. DABCD 2. ADBCD 3. ADBCD 4. ABDCD 5. ABDCD 6. ABCDD 7. ABCDD 8. ABCDD 9. ABCDD
• insert item Ij in all possible positions that follow its rightmost
occurrence is a k-sequence S. If the item does not occur at all in the
sequence, then it is inserted in all positions.
AAA
ADAA AAAD
ADAADADAAD
D D
DD
Useless Set of a sequence S• after intersecting S with item Ij, it is inserted in S.Useless
• when intersecting S with item Ij, all items Ik (k<j) are in S.Useless
• S.Useless is ‘inherited’ by the (k+1)-sequences produced
SET(S+A,A) SET(S+B,B)
SET(S+A) SET(S+B)
Sequence S
SET(S+A,B)=SET(S+B,A)
A
A BA B
B
AB+EEABAEBAEBABEABE
AB+DDABADBADBABDABD
DAB +EEDABDEABDEABDAEBDAEBDABEDABE
Sequence S
SET(S+D)
D
SET(S+E)
E
SET(S+D,E)
E
not f
requ
ent
not f
requ
ent
Bound to be Bound to be not frequentnot frequent
Scenario 1
Scenario 2
Open Problems
• Output subexponential maximal sequential pattern algorithms
• Efficient algorithms for finding episodes (approximate sequential patterns – edit distance)
Spatiotemporal DatasetsTemperature Map US
Snow-ice-rain radar US Snow-ice-rain radar NE
Precipitation radar Bay AreaPrecipitation radar Lakes
Mining Spatiotemporal Data
• CONQUEST, [Stolorz et al, KDD 1995]– Patterns in global climate change
• SKICAT, [Fayyad et al, 1996]– Image processing techniques and classification
techniques to identify objects in satellite pictures• GeoMiner [Han et al, 1997]• MultiMediaMiner, [Zaiane et al, 1998]
– Data Cube structure. Mining of association and classification rules.
• DFS-Mine, [Tsoukatos et al, 2001]– Discovery of spatiotemporal patterns