5 detailed

7/30/2019 5 Detailed

1/55

Mining Association Rulesin Large Databases


2/55

Association Rule Mining

Given a set of transactions, find rules that willpredict the occurrence of an item based on theoccurrences of other items in the transaction

Market-Basket transactions

TID I tems

1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},

{Milk, Bread} {Eggs,Coke},

{Beer, Bread} {Milk},

Implication means co-occurrence,

not causality!


3/55

Definition: Frequent Itemset

Itemset

A collection of one or more items Example: {Milk, Bread, Diaper}

k-itemset

An itemset that contains k items

Support count () Frequency of occurrence of an

itemset

E.g. ({Milk, Bread,Diaper}) = 2

Support

Fraction of transactions that

contain an itemset E.g. s({Milk, Bread, Diaper}) =

2/5

Frequent Itemset

An itemset whose support is

greater than or equal to a minsupthreshold

TID I tems

1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke

I assume that itemsets areordered lexicographically


4/55

Definition: Association Rule

Let D be database oftransactions

e.g.:

Let I be the set of items that appear in the

database, e.g., I={A,B,C,D,E,F} A rule is defined by X Y, where XI,

YI, and XY=

e.g.: {B,C} {E} is a rule

Transaction ID Items Bought

2000 A,B,C

1000 A,C4000 A,D

5000 B,E,F


5/55

Definition: Association Rule

Example:

Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk( s

67.03

2

)Diaper,Milk(

)BeerDiaper,Milk,(

c

Association Rule

An implication expression of theform X Y, where X and Y areitemsets

Example:

{Milk, Diaper} {Beer}

Rule Evaluation Metrics

Support (s)

Fraction of transactions that

contain both X and Y Confidence (c)

Measures how often items in Yappear in transactions thatcontain X

TID I tems

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke


6/55

Rule Measures: Support and

ConfidenceFind all the rulesX Ywith

minimum confidence andsupport support, s, probability that a

transaction contains {X Y}

confidence, c,conditionalprobability that a transactionhaving X also contains Y

Transaction ID Items Bought

2000 A,B,C

1000 A,C

4000 A,D

5000 B,E,F

Let minimum support 50%,

and minimum confidence50%, we have

A C (50%, 66.6%)

C A (50%, 100%)

Customer

buys diaper

Customerbuys both

Customer

buys beer


7/55

TID date items_bought100 10/10/99 {F,A,D,B}

200 15/10/99 {D,A,C,E,B}

300 19/10/99 {C,A,B,E}

400 20/10/99 {B,A,D}

Example

What is the support and confidence of therule: {B,D} {A}

Support:

percentage of tuples that contain {A,B,D} =

Confidence:

D}{B,containthattuplesofnumber

D}B,{A,containthattuplesofnumber

75%

100%

Remember:conf(X Y) =

sup(X)

Y)sup(X


8/55

Association Rule Mining Task

Given a set of transactions T, the goal ofassociation rule mining is to find all ruleshaving

support minsup threshold

confidence minconfthreshold

Brute-force approach:

List all possible association rules

Compute the support and confidence for eachrule

Prune rules that fail the minsup and minconfthresholds

Computationally prohibitive!


9/55

Mining Association Rules

Example of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67)

{Milk,Beer} {Diaper} (s=0.4, c=1.0)

{Diaper,Beer} {Milk} (s=0.4, c=0.67)

{Beer} {Milk,Diaper} (s=0.4, c=0.67){Diaper} {Milk,Beer} (s=0.4, c=0.5)

{Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID I tems

1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke

Observations:

All the above rules are binary partitions of the same itemset:

{Milk, Diaper, Beer}

Rules originating from the same itemset have identical support but

can have different confidence

Thus, we may decouple the support and confidence requirements


10/55

Mining Association Rules

Two-step approach:

1. Frequent Itemset Generation

Generate all itemsets whose support minsup

2. Rule Generation

Generate high confidence rules from each frequentitemset, where each rule is a binary partitioning of afrequent itemset

Frequent itemset generation is stillcomputationally expensive


11/55

Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there

are 2d possible

candidate itemsets


12/55

Frequent Itemset Generation Brute-force approach:

Each itemset in the lattice is a candidate frequent itemset

Count the support of each candidate by scanning thedatabase

Match each transaction against every candidate

Complexity ~ O(NMw) => Expensive since M = 2d!!!

TID I tems

1 Bread, Milk





N

Transactions List of

Candidates

M

w


13/55

Computational ComplexityGiven d unique items:

Total number of itemsets = 2d

Total number of possible association rules:

1231

1

1 1

dd

d

k

kd

j j

kd

k

d

R

If d=6, R = 602 rules


14/55

Frequent Itemset Generation Strategies

Reduce the number of candidates (M) Complete search: M=2d

Use pruning techniques to reduce M

Reduce the number of transactions (N)

Reduce size of N as the size of itemsetincreases

Used by DHP and vertical-based miningalgorithms

Reduce the number of comparisons (NM) Use efficient data structures to store the

candidates or transactions

No need to match every candidate against

every transaction


15/55

Reducing Number of Candidates

Apriori principle: If an itemset is frequent, then all of its subsets

must also be frequent

Apriori principle holds due to the followingproperty of the support measure:

Support of an itemset never exceeds the supportof its subsets

This is known as the anti-monotone property of

support

)()()(:, YsXsYXYX


16/55

Example

TID I tems

1 Bread, Milk





s(Bread) > s(Bread, Beer)

s(Milk) > s(Bread, Milk)

s(Diaper, Beer) > s(Diaper, Beer, Coke)


17/55


18/55

Illustrating Apriori Principle

Item Count

Bread 4Coke 2Milk 4Beer 3

Diaper 4Eggs 1

Itemset Count

{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

Itemset Count

{Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,6C1 +

6C2 +6C3 = 41

With support-based pruning,6 + 6 + 1 = 13


19/55

The Apriori Algorithm (the general idea)

1. Find frequent 1-items and put them to Lk (k=1)

2. Use Lk to generate a collection ofcandidateitemsets Ck+1 with size (k+1)

3. Scan the database to find which itemsets in Ck+1are frequent and put them into Lk+1

4. If Lk+1 is not empty

k=k+1

GOTO 2

R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",

Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994.


20/55

The Apriori Algorithm

Pseudo-code:Ck: Candidate itemset of size kLk: frequent itemset of size k

L1 = {frequent items};for(k= 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;// join and prune stepsfor each transaction tin database do

increment the count of all candidates in Ck+1that are contained in t

Lk+1 = candidates in Ck+1 with min_support (frequent)end

returnkLk;

Important steps in candidate generation: Join Step: Ck+1is generated by joining Lk with itself

Prune Step: Any k-itemset that is not frequent cannot bea subset of a frequent (k+1)-itemset


21/55

The Apriori AlgorithmExample

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

Scan D

C1L1

itemset

{1 2}

{1 3}

{1 5}

{2 3}{2 5}

{3 5}

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3

{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

L2

C2 C2Scan D

C3 L3itemset

{2 3 5}

Scan D itemset sup

{2 3 5} 2

min_sup=2=50%


22/55

How to Generate Candidates?

Suppose the items in Lkare listed in an order

Step 1: self-joining Lk(IN SQL)

insert intoCk+1

selectp.item1, p.item2, , p.itemk, q.itemkfrom Lkp, Lkq

wherep.item1=q.item1, , p.itemk-1=q.itemk-1,p.itemk{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Lattice of rules

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned

Rules

Low

Confidence

Rule

R le Generation for Apriori


40/55

Rule Generation for Apriori

Algorithm

Candidate rule is generated by mergingtwo rules that share the same prefixin the rule consequent

join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC

Prune rule D=>ABC if itssubset AD=>BC does not have

high confidence

BD=>ACCD=>AB

D=>ABC

A i i F h?


41/55

Is Apriori Fast Enough?

Performance Bottlenecks

The core of the Apriori algorithm: Use frequent (k 1)-itemsets to generate candidate frequent k-

itemsets

Use database scan and pattern matching to collect counts for

the candidate itemsets

The bottleneck ofApriori: candidate generation

Huge candidate sets:

104 frequent 1-itemset will generate 107 candidate 2-itemsets

To discover a frequent pattern of size 100, e.g., {a1, a2, ,a100}, one needs to generate 2

100 1030 candidates.

Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern

FP h Mi i F P


42/55

FP-growth: Mining Frequent Patterns

Without Candidate Generation

Compress a large database into a compact,

Frequent-Pattern tree (FP-tree) structure

highly condensed, but complete for frequent pattern

mining avoid costly database scans

Develop an efficient, FP-tree-based frequent

pattern mining method

A divide-and-conquer methodology: decompose miningtasks into smaller ones

Avoid candidate generation: sub-database test only!

FP tree Construction from a


43/55

FP-tree Construction from a

Transactional DB

I tem frequency

f 4c 4a 3b 3

m 3p 3

min_support = 3TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find frequent 1-itemsets (single

item patterns)2. Order frequent items in descending order of

their frequency

3. Scan DB again, construct FP-tree


44/55

FP-tree Construction

root

TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}

I tem frequency

f 4c 4a 3b 3

m 3p 3

min_support = 3

f:1

c:1

a:1

m:1

p:1


45/55


root

I tem frequency

f 4c 4a 3b 3

m 3p 3

min_support = 3

f:2

c:2

a:2

m:1

p:1

b:1

m:1



46/55


root

I tem fr equency

f 4c 4a 3b 3

m 3p 3

min_support = 3

f:3

c:2

a:2

m:1

p:1

b:1

m:1

b:1


c:1

b:1

p:1


47/55


root

I tem frequency

f 4c 4a 3b 3

m 3p 3

min_support = 3

f:4

c:3

a:3

m:2

p:2

b:1

m:1

b:1


c:1

b:1

p:1

Header TableI tem frequency head

f 4c 4a 3

b 3m 3

p 3


48/55

Benefits of the FP-tree Structure

Completeness: never breaks a long pattern of any transaction

preserves complete information for frequent pattern mining

Compactness reduce irrelevant informationinfrequent items are gone

frequency descending ordering: more frequent items aremore likely to be shared

never be larger than the original database (if not count

node-links and counts) Example: For Connect-4 DB, compression ratio could be

over 100

Mi i F P U i


49/55

Mining Frequent Patterns Using

FP-tree

General idea (divide-and-conquer)

Recursively grow frequent pattern path using the FP-tree

Method

For each item, construct its conditional pattern-base, andthen its conditional FP-tree

Repeat the process on each newly created conditional FP-tree

Until the resulting FP-tree is empty, or it containsonly

one path(single path will generate all the combinations of itssub-paths, each of which is a frequent pattern)


50/55

Mining Frequent Patterns Using the FP-tree(contd)

Start with last item in order (i.e., p). Follow node pointers and traverse only the paths containing p. Accumulate all of transformed prefix paths of that item to form

a conditional pattern base

Conditionalpattern base for p

fcam:2, cb:1

f:4

c:3

a:3

m:2

p:2

c:1

b:1

p:1

p

Construct a new FP-tree based

on this pattern, by merging allpaths and keeping nodes thatappear sup times. This leads toonly one branch c:3Thus we derive only one frequentpattern cont.p. Pattern cp


51/55

Mining Frequent Patterns Using the FP-tree(contd)

Move to next least frequent item in order, i.e., m Follow node pointers and traverse only the paths containing m. Accumulate all of transformed prefix paths of that item to form

a conditional pattern base

f:4

c:3

a:3

m:2

m

m:1

b:1

m-conditionalpattern base:

fca:2, fcab:1

{}

f:3c:3

a:3m-conditionalFP-tree (contains only path fca:3)

All frequent patternsthat include m

m,

fm, cm, am,

fcm, fam, cam,

fcam


52/55

Properties of FP-tree for Conditional Pattern

Base Construction

Node-link property

For any frequent item ai,all the possible frequent patterns

that contain aican be obtained by following ai's node-links,

starting from ai's head in the FP-tree header

Prefix path property

To calculate the frequent patterns for a node ai in a path P,

only the prefix sub-path ofai in P need to be accumulated,

and its frequency count should carry the same count as

node ai.


53/55

Conditional Pattern-Bases for the example

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem


54/55

Why Is Frequent Pattern Growth Fast?

Performance studies show

FP-growth is an order of magnitude faster than Apriori,

and is also faster than tree-projection

Reasoning

No candidate generation, no candidate test

Uses compact data structure

Eliminates repeated database scan

Basic operation is counting and FP-tree building

FP growth vs Apriori: Scalability With


55/55

FP-growth vs. Apriori: Scalability Withthe Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Runtime(sec.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

5 detailed

Documents