seem4630 2013-2014

14
SEEM4630 2015- 2016 Tutorial 3 – Frequent Itemset Mining

Upload: pabla

Post on 25-Jan-2016

71 views

Category:

Documents


9 download

DESCRIPTION

SEEM4630 2013-2014. Tutorial 3 – Frequent Pattern Mining. Frequent Patterns. Frequent pattern : a pattern ( a set of items , subsequences, substructures, etc.) that occurs frequently in a data set itemset: A set of one or more items k-itemset : X = {x 1 , …, x k } Mining algorithms Apriori - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SEEM4630 2013-2014

SEEM4630 2015-2016

Tutorial 3 – Frequent Itemset Mining

Page 2: SEEM4630 2013-2014

2

Frequent Itemsets Frequent itemset: a set of items that occurs

frequently in a data set itemset: A set of one or more items k-itemset: X = {x1, …, xk} Mining algorithms

Apriori FP-growth Tid Items bought

10 Beer, Nuts, Diaper

20 Beer, Coffee, Diaper

30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk

50 Nuts, Coffee, Diaper, Eggs, Beer

Page 3: SEEM4630 2013-2014

3

Support & Confidence Support

(absolute) support, or, support count of X: Frequency or occurrence of an itemset X

(relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)

An itemset X is frequent if X’s support is no less than a minsup threshold

Confidence (association rule: XY ) sup(XY)/sup(x) (conditional prob.: Pr(Y|X) = Pr(X^Y)/Pr(X) ) confidence, c, conditional probability that a transaction

having X also contains Y Find all the rules XY with minimum support and confidence

sup(XY) ≥ minsup sup(XY)/sup(X) ≥ minconf

Page 4: SEEM4630 2013-2014

4

Apriori Principle If an itemset is frequent, then all of its subsets must also be

frequent If an itemset is infrequent, then all of its supersets must be

infrequent toonull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

frequent

frequent infrequent

infrequent

(X Y)

(¬Y ¬X)

Page 5: SEEM4630 2013-2014

5

Apriori: A Candidate Generation & Test Approach Initially, scan DB once to get frequent 1-

itemset Loop

Generate length (k+1) candidate itemsets from length k frequent itemsets

Test the candidates against DB Terminate when no frequent or candidate set

can be generated

Page 6: SEEM4630 2013-2014

6

Generate candidate itemsets Example Frequent 3-itemsets: {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4},

{1, 3, 5}, {2, 3, 4}, {2, 3, 5} and {3, 4, 5} Candidate 4-itemset: {1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 4, 5}, {1, 3, 4,

5}, {2, 3, 4, 5} Which need not to be counted? {1, 2, 4, 5}, {1, 3, 4, 5}, {2, 3, 4, 5}

Page 7: SEEM4630 2013-2014

7

Maximal vs Closed Frequent Itemsets An itemset X is a max-pattern if X is frequent and

there exists no frequent super-pattern Y כ X An itemset X is closed if X is frequent and there

exists no super-pattern Y כ X, with the same support as X

Frequent Itemsets

Closed Frequent Itemsets

Maximal Frequent Itemsets

Closed Frequent Itemsets are Lossless: the support for any frequent itemset can be deduced from the closed frequent itemsets

Page 8: SEEM4630 2013-2014

8

Maximal vs Closed Frequent Itemsets

# Closed = 9

# Maximal = 4

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Closed and maximal frequent

Closed but not maximal

minsup=2

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

Page 9: SEEM4630 2013-2014

9

Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation

The FP-Growth Approach Depth-first search (Apriori: Breadth-first search)

Avoid explicit candidate generation

Fp-tree construction:

• Scan DB once, find frequent 1-itemset (single item pattern)

• Sort frequent items in frequency descending order, f-list

• Scan DB again, construct FP-tree

FP-Growth approach:

• For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree

• Repeat the process on each newly created conditional FP-tree

• Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

Page 10: SEEM4630 2013-2014

10

Find Patterns Having p From P-conditional Database

Start at the frequent item header table in the FP-tree

Traverse the FP-tree by following the link of each frequent item p

Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

f, c, a, m, p5

c, b, p4

f, b3

f, c, a, b, m2

f, c, a, m, p1

f, c, a, m, p5

c, b, p4

f, b3

f, c, a, b, m2

f, c, a, m, p1

Page 11: SEEM4630 2013-2014

11

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

{}

f:2 c:1

b:1

p:1

c:2

a:2

m:2

{}

f:3

c:3

a:3

b:1

{}

f:2 c:1

c:1

a:1

{}

f:3

c:3

{}

f:3

+ p

+ m

+ b

+ a

+ c

f:4

(1) (2)

(3) (4) (5) (6)

Page 12: SEEM4630 2013-2014

12

f, c, a, m, p5

c, b, p4

f, b3

f, c, a, b, m2

f, c, a, m, p1

f, c, a, m, p5

c, b, p4

f, b3

f, c, a, b, m2

f, c, a, m, p1

f, c, a5

c, b4

f, b3

f, c, a, b2

f, c, a1

f, c, a5

c, b4

f, b3

f, c, a, b2

f, c, a1

f, c, a, m5

c, b4

f, c, a, m1

f, c, a, m5

c, b4

f, c, a, m1

f, c, a5

f, c, a, b2

f, c, a1

f, c, a5

f, c, a, b2

f, c, a1

f, c, a, m5

c, b4

f, b3

f, c, a, b, m2

f, c, a, m1

f, c, a, m5

c, b4

f, b3

f, c, a, b, m2

f, c, a, m1

c4

f3

f, c, a2

c4

f3

f, c, a2

f, c, a5

c4

f3

f, c, a2

f, c, a1

f, c, a5

c4

f3

f, c, a2

f, c, a1 f, c5

f, c2

f, c1

f, c5

f, c2

f, c1

f, c5

c4

f3

f, c2

f, c1

f, c5

c4

f3

f, c2

f, c1

+ p

+ m

+ b

+ a

FP-Growth

Page 13: SEEM4630 2013-2014

13

f, c, a, m, p5

c, b, p4

f, b3

f, c, a, b, m2

f, c, a, m, p1

f, c, a, m, p5

c, b, p4

f, b3

f, c, a, b, m2

f, c, a, m, p1

f, c, a, m5

c, b4

f, c, a, m1

f, c, a, m5

c, b4

f, c, a, m1+ p

f, c, a5

f, c, a, b2

f, c, a1

f, c, a5

f, c, a, b2

f, c, a1

+ m

c4

f3

f, c, a2

c4

f3

f, c, a2

+ bf, c5

f, c2

f, c1

f, c5

f, c2

f, c1

+ a

f: 1,2,3,5

(1) (2)

(3) (4)

(5)(6)

+ c

f5

4

f2

f1

f5

4

f2

f1

FP-Growth

Page 14: SEEM4630 2013-2014

14

f, c, a, m, p5

c, b, p4

f, b3

f, c, a, b, m2

f, c, a, m, p1

f, c, a, m, p5

c, b, p4

f, b3

f, c, a, b, m2

f, c, a, m, p1

f, c, a, m5

c, b4

f, c, a, m1

f, c, a, m5

c, b4

f, c, a, m1+ p

f, c, a5

f, c, a, b2

f, c, a1

f, c, a5

f, c, a, b2

f, c, a1

+ m

c4

f3

f, c, a2

c4

f3

f, c, a2

+ b

f, c5

f, c2

f, c1

f, c5

f, c2

f, c1

+ a

f: 1,2,3,5

+ pc5

c4

c1

c5

c4

c1p: 3cp: 3

f, c, a5

f, c, a2

f, c, a1

f, c, a5

f, c, a2

f, c, a1

+ mm: 3fm: 3cm: 3am: 3fcm: 3fam: 3cam: 3fcam: 3

b: 3

f: 4

a: 3fa: 3ca: 3fca: 3

c: 4fc: 3

+ c

f5

4

f2

f1

f5

4

f2

f1

min_sup = 3