lecture 9. frequent pattern mining in streamsgavalda/datastreamseminar/files/lecture9.pdf ·...

32
Lecture 9. Frequent pattern mining in streams Ricard Gavald` a MIRI Seminar on Data Streams, Spring 2015 1 / 32

Upload: others

Post on 24-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Lecture 9. Frequent pattern mining in streams

Ricard Gavalda

MIRI Seminar on Data Streams, Spring 2015

1 / 32

Page 2: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Contents

1 Frequent pattern mining - batch

2 Frequent pattern mining in data streams

3 IncMine: itemset mining in MOA

2 / 32

Page 3: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent pattern mining - batch

3 / 32

Page 4: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent pattern mining - batch

P: a set of patterns�: subpattern relation, a partial order

Examples:

sets with subset relationsequences with (some) subsequence relationtrees with (some) subtree relationgraphs with (some) subgraph relation

4 / 32

Page 5: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent pattern mining - batch

D : a database, or multiset, of patternss(D ,p) = absolute support of p in D = |{p′ ∈D : p � p′}|σ(D ,p) = relative support = s(D ,p)/|D |σ : a minimum support threshold

The frequent pattern mining taskGiven D , σ , find all the patterns p such that σ(D ,p)≥ σ

5 / 32

Page 6: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent pattern mining - batch

Computationally costly, for two reasons:

1 Many candidate frequent patternse.g. 2k itemsets if k distinct items

2 Many frequent patterns actually present in database

6 / 32

Page 7: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent pattern mining - batch

For problem 1: discard many candidate patterns soon

Antimonotonicity - the apriori principleIf p � p′, then σ(p)≥ σ(p′)

For problem 2: compute a smaller set with same information

Closed pattern (in D)p is closed if every proper superpattern of p has strictly smallersupport

7 / 32

Page 8: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Closed patterns

FactFrequent patterns and their frequencies can be generated(easily) from closed patterns and their frequencies

There are typically much fewer frequent closed patterns thatthere are frequent patterns∴savings if we only compute closed frequent patterns

8 / 32

Page 9: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent closed patterns - batch

Central Concept(and data structure):

Galois Lattice

BA

9 / 32

Page 10: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent closed patterns - batch

batch frequent closed . . .

itemset miners: CLOSET, CHARM, CLOSET+ . . .sequence miners [Wang 04]tree miners [Balcazar-Bifet-Lozano 06-10]graph miners [Yan03]

10 / 32

Page 11: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent pattern mining in data streams

11 / 32

Page 12: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent patterns in data streams

Requirements:low time per pattern, small memory, adapt to change

Taxonomy:

Exact or Approximatewith false positives and/or false negatives

Per batch or per transactionIncremental, sliding window, or fully adaptiveFrequent or frequent closed

12 / 32

Page 13: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Frequent closed patterns

A general framework [Bifet-G 11] (based on [BBL06-10])

Use a base batch minerCollect a batch of transactions from streamCompute all closed patterns and counts, CMerge C into summary of frequent closed patterns forstream

13 / 32

Page 14: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Closure Operator

Given a dataset D of patterns and a pattern t ,

Closure of a pattern∆D(t), the closure of t , is the intersection of all patterns in Dthat contain t

Factt is closed in D if and only if it is in ∆D(t)

Note: no mention of support!!

14 / 32

Page 15: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Adding and removing pattern batches

PropositionA pattern t is closed in D1∪D2 if and only if

it is closed in D1, orit is closed in D2, orit is a subpattern of a closed pattern in D1, and of a closedsubpattern in D2, and is in ∆D1(t)∩∆D2(t)

15 / 32

Page 16: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Incremental Algorithm

Computing the lattice of frequent patternsConstruct empty lattice L;Repeat

Collect batch of B patterns;Build closed pattern lattice for B, L′;L = merge(L,L′) (using addition rule);delete from L patterns with support below σ

Memory & time depend on lattice size ( = number of closedpatterns), not on DB size!Batch size depends on tradeoff batch miner time / merging time

16 / 32

Page 17: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Fully adaptive algorithm

Keep a window on recent stream batchesActually, only their lattices of closed patterns

When new batch added, drop oldest batch, and undo itseffect using closure definition

Alternatively:Use change detectors to decide which batches are staleE.g. on number of patterns that enter or leave lattice

17 / 32

Page 18: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Further improvement: relaxed support

Consider c-relaxed support intervals: [c i ,c i+1)

A pattern in interval I is c-closed if the support of everysuperpattern is in another interval

Largely reduces lattice sizes & computation time, at the cost ofc-approximate counts

18 / 32

Page 19: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

IncMine: itemset mining in MOA

19 / 32

Page 20: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Closed itemset miners in data streams

Exact: MOMENT [Chi+ 06], NEWMOMENT [Li+ 09],CLOSTREAM [Yen+ 11], . . .High computational cost for exactness

Approximate: IncMine [Cheng+ 08], CLAIM [Song+ 07], . . .More efficient at the expense of false positives and/ornegatives

20 / 32

Page 21: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

The IncMine Algorithm [Cheng,Ke,Ng 08]

Some features:

Keeps frequent closed itemsets in a sliding windowApproximate algorithm, controlled by relaxation parameterDrops non-promising itemsets: may have false negatives

Chosen for implementation in MOA [Quadrana-Bifet-G 13&15]

21 / 32

Page 22: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Non-promising itemsets

Assume window of last W transactions, min. support σ

If t is σ -frequent in W , we expect σw occurrences in firstw elements of window (w < W )(assuming no change)choose to drop it if much fewer occurrences

more precisely, if less than σ · r(w), forr(w) = r + (1− r)w/Wso that r(0) = r and r(W ) = 1

Erroneously dropped itemsets will be false negatives

22 / 32

Page 23: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Non-promising itemsets

Inverted FCI index to keep updated itemsets within window

Requires a batch method for finding FCI in new batch

We chose CHARM [Zaki+ 02]

23 / 32

Page 24: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Experiments: Accuracy

Zaki’s synthetic frequent itemset generator (standard in field)

100% precision (no false negatives)

100% recall up to r = 0.6; down to 82% by r = 0.8

24 / 32

Page 25: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Experiments: Throughput

Transactions/second for different values of r (σ = 0.1). Theminimum support used for MOMENT is equal to 500. Note thelogarithmic scale in the y axis

25 / 32

Page 26: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Experiments: Throughput

Transactions/second for different values of σ (r = 0.5). Theminimum support used for MOMENT is equal to σ ·5000. Notethe logarithmic scale in the y axis

26 / 32

Page 27: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Reaction to Sudden Drift

T40I10kD1MP6 drifts to T50I10kD1MP6C05 dataset

Reaction time grows linearly with window size

27 / 32

Page 28: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Reaction to Gradual Drift

Fast reaction with small windowsStable response with big windows

28 / 32

Page 29: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Analyzing MOVIELENS (I)

About 10 million ratings over 10681 movies by 71567 usersStatic data set for movie rating (from 29 Jan 1996 to 15Aug 2007)Movies grouped by rating time (every 5 minutes)Transactions passed in ascending time to create a streamStream of 620,000 transactions with average length 10.4

Results:Evolution of popular movies over timeUnnoticed with static dataset analysis

29 / 32

Page 30: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Analyzing MOVIELENS (II)

date Frequent Itemsets

Dec 2001 Lord of the Rings: The Fellowship of the Ring, The (2001); Beautiful Mind, A (2001).

Harry Potter and the Sorcerer’s Stone (2001); Lord of the Rings: The Fellowship of the Ring, The (2001).

Jul 2002 Spider-Man (2002); Star Wars: Episode II - Attack of the Clones (2002).

Bourne Identity, The (2002); Minority Report (2002).

Dec 2002 Lord of the Rings: The Fellowship of the Ring, The (2001); Lord of the Rings: The Two Towers, The (2002).

Minority Report (2002); Signs (2002).

Jul 2003 Lord of the Rings: The Fellowship of the Ring, The (2001); Lord of the Rings: The Two Towers, The (2002).

Lord of the Rings: The Two Towers, The (2002); Pirates of the Caribbean: The Curse of the Black Pearl(2003).

30 / 32

Page 31: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Analysis

Model: t-th itemset draw independently from distribution Dt onset of all transactions

TheoremAssume that Dt−W = · · ·= Dt−1 = Dt , that is, no distributionchange in the previous W time steps. Let Ot be the set of FCIoutput by IncMine(σ ,r ) at time t. Then, for every itemset X andevery δ ∈ (0,1),

1 if σ(X ,Dt )≤ (1− ε)σ then, with probability at least 1−δ , Xis not in Ot .

2 if σ(X ,Dt )≥ (1 + ε)σ then, with probability at least 1−δ , Xis in Ot .

provided ε ≥ f (W ,B,σ ,δ ) and r ≤ g(W ,B,σ ,δ ).

Bonus: Analysis reveals relaxation rate r(.) in original paper isnot optimal. Nonpromising sets can be dropped much earlier.And parameter r not needed

31 / 32

Page 32: Lecture 9. Frequent pattern mining in streamsgavalda/DataStreamSeminar/files/Lecture9.pdf · Lecture 9. Frequent pattern mining in streams Ricard Gavalda` MIRI Seminar on Data Streams,

Conclusions

Perfect integration with MOAGood accuracy and performance compared with MOMENTGood throughput and reasonable memory consumptionGood adaptivity to concept driftAnalyzable under common probabilistic assumptionsUsable in real contexts

32 / 32