mining sequential patterns dimitrios gunopulos, ucr

Mining Sequential Patterns

Dimitrios Gunopulos, UCR

Finding Frequent Sequential Patterns

• The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently.

• For example: A followed by B (A->B), or A followed by C (A->C) The patterns should occur within a window W.

• Applications in telecommunication data, networks, biology

Sequences

• Sequence

((T=90F) (H=60%, P=1.1atm))

time t1 Later time t2

attribute value

item

itemitemitemset

• k-sequence: sequence with k items

• T1H2P1T3P2, P1T2H4P2T5: 5-sequences

• S1 is subsequence of S2 (S1 S2)

• T1P1T2 H1T1P2H2P1T2 (T1H1T1 , P1T2H2P1T2)

• H1P1T2 H1T1P2H2P1T2

Sequential Patterns: The Problem

• support or frequency of a sequence S ((S)):

• = the total number of times sequence S is encountered

• user specified minimum threshold min_sup

• S is frequent (S) min_sup

• S:maximal frequent sequenceS is frequent and all of its supersequences are non-frequent

• S:minimal non-frequent sequenceS is non frequent and all of its subsequences are frequent

• The problem

• Given: database D and min_sup

• the problem: find all frequent sequences in D

Example Database

Algorithms for Sequential Patterns

• Apriori, GSP[Srikant, Agrawal, EDBT 1996][Mannila, Toivonen, Verkamo, DMKD 1997]

• SPADE, Parallel Spade [Zaki, 2001]• FreeSpan, PrefixSpan

[Han et al, SIGKDD 2000], [Pei et al, ICDE 2001]• Sequential Patterns with constraints

[Garofalakis et al, VLDB 99]• DFS-Mine [Tsoukatos and Gunopulos, SSTD 2001]

The Lattice Structure

• Lemma: All subsequences of a frequent sequence are frequent

SPADE ([Zaki, 2001])

• Lattice-based approach

• vertical id-list format

• enumerates all frequent sequences equivalence classes to decompose the problem:

• two k-sequences belong in the same []i class if they have the same i-length prefix

• each class fits in main memory

• generates a (k+1)-sequence by intersecting two k-sequences that have common (k-1)-length prefix

• minimizes I/O cost - 2 database scans:

• frequent 1-sequences, frequent 2-sequences

DFS_MINE

• Depth-First-Search approach• fast discoveries of long

maximal frequent patterns• uses minimal amount of

memory• some frequent sequences are

deduced to be frequent from lattice

• candidate (k+1)-sequence: intersect a k-sequence with all frequent items (FreqItems)

• in main memory:• S.Useless: set of items sequence S must not be intersected with

• MaxFreqList: List of Maximal Frequent Sequences

• MinNonFreqList: List of Minimal Non Frequent Sequences

• scan database to determine the support of candidate sequences

BCD

CD In MinNonFreqList

candidate

BCD

ABCDE

In MaxFreqList

candidate

MaxFreqList - MinNonFreqListLemma: All subsequences of a frequent sequence are also frequent

• Supersequences S is inserted in MinNonFreqList if:

• S is not in MinNonFreqList

• S is not a supersequence of a sequence in MinNonFreqList

• S was scanned in database and was found to be non-frequent

• Supersequences of S in MinNonFreqList are removed.

• S is inserted in MaxFreqList if:

• S is not in MaxFreqList

• S is not a subsequence of a sequence in MaxFreqList

• S was scanned in database and was found to be frequent

• Subsequences of S in MaxFreqList are removed.

Examining Candidate Sequences

• k-sequence S is intersected with all items Ij in FreqItems-S.Useless

• resulting sets SET(S+Ij) for all Ij

• each sequence S:

• check MinNonFreqList

• check MaxFreqList

• scan database for all unknown sequences (if any) in SET(S+Ij) for all Ij

(1pass)

• update MaxFreqList, MinNonFreqList

Generating sequences• k-sequence S + Ij in FreqItems-S.Useless = candidate (k+1)-sequences

ABCD + E1. EABCD 2. AEBCD 3. AEBCD 4. ABECD 5. ABECD 6. ABCED 7. ABCED 8. ABCDE 9. ABCDE

ABCD + D1. DABCD 2. ADBCD 3. ADBCD 4. ABDCD 5. ABDCD 6. ABCDD 7. ABCDD 8. ABCDD 9. ABCDD

• insert item Ij in all possible positions that follow its rightmost

occurrence is a k-sequence S. If the item does not occur at all in the

sequence, then it is inserted in all positions.

AAA

ADAA AAAD

ADAADADAAD

D D

DD

Useless Set of a sequence S• after intersecting S with item Ij, it is inserted in S.Useless

• when intersecting S with item Ij, all items Ik (k<j) are in S.Useless

• S.Useless is ‘inherited’ by the (k+1)-sequences produced

SET(S+A,A) SET(S+B,B)

SET(S+A) SET(S+B)

Sequence S

SET(S+A,B)=SET(S+B,A)

A

A BA B

B

AB+EEABAEBAEBABEABE

AB+DDABADBADBABDABD

DAB +EEDABDEABDEABDAEBDAEBDABEDABE

Sequence S

SET(S+D)

D

SET(S+E)

E

SET(S+D,E)

E

not f

requ

ent

not f

requ

ent

Bound to be Bound to be not frequentnot frequent

Scenario 1

Scenario 2

Open Problems

• Output subexponential maximal sequential pattern algorithms

• Efficient algorithms for finding episodes (approximate sequential patterns – edit distance)

Spatiotemporal DatasetsTemperature Map US

Snow-ice-rain radar US Snow-ice-rain radar NE

Precipitation radar Bay AreaPrecipitation radar Lakes

Mining Spatiotemporal Data

• CONQUEST, [Stolorz et al, KDD 1995]– Patterns in global climate change

• SKICAT, [Fayyad et al, 1996]– Image processing techniques and classification

techniques to identify objects in satellite pictures• GeoMiner [Han et al, 1997]• MultiMediaMiner, [Zaiane et al, 1998]

– Data Cube structure. Mining of association and classification rules.

• DFS-Mine, [Tsoukatos et al, 2001]– Discovery of spatiotemporal patterns

Open Problems

• Similarity models and indexing techniques for higher-dimensional time series

• Efficient trend detection/subsequence matching algorithms

• Algorithms to capture the data distribution when it changes over time

• New models for capturing the evolution of spatial phenomena over time

mining sequential patterns dimitrios gunopulos, ucr

Documents

sequence s s

frequent slide

frequent subsequences

maximal frequent sequence

frequent supersequences

minnonfreqlist s

maxfreqlist s

items sequence s