association analysis (3)

32
Association Analysis (3)

Upload: morrison

Post on 21-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Association Analysis (3). Alternative Methods for Frequent Itemset Generation. Traversal of Itemset Lattice General-to-specific vs Specific-to-general (top-down vs. bottom-up). Apriori From ( k -1)-itemsets, create k -itemsets (more “specific”). Used for finding Max-Frequent Itemsets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Association Analysis (3)

Association Analysis (3)

Page 2: Association Analysis (3)

Alternative Methods for Frequent Itemset Generation• Traversal of Itemset Lattice

– General-to-specific vs Specific-to-general (top-down vs. bottom-up)Frequentitemsetborder null

{a1,a2,...,an}

(a) General-to-specific

null

{a1,a2,...,an}

Frequentitemsetborder

(b) Specific-to-general

..

......

Frequentitemsetborder

null

{a1,a2,...,an}

(c) Bidirectional

..

..

AprioriFrom (k-1)-itemsets, create k-itemsets (more “specific”)

Used for finding Max-Frequent ItemsetsIf a k-itemset in the lattice isn’t Max-Freq, then we don’t

need to examine any of its subsets of size (k-1).

Page 3: Association Analysis (3)

Traversal of Itemset Lattice• Search within an equivalence class first before moving to another equivalence class.

– APRIORI algorithm (implicitly) partitions itemsets into equivalence classes based on their length (same length – same class)– However, we can search by partitioning (implicitly) according to the prefix or suffix labels of an itemset.

null

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD

ABCD

null

AB AC ADBC BD CD

A B C D

ABC ABD ACD BCD

ABCD

(a) Prefix tree (b) Suffix tree

Alternative Methods for Frequent Itemset Generation

Page 4: Association Analysis (3)

• Traversal of Itemset Lattice– Breadth-first vs. Depth-first

(a) Breadth first (b) Depth first

Alternative Methods for Frequent Itemset Generation

APRIORI algorithm traverses the lattice in a breadth first manner.

It first discovers all the frequent 1 itemsets, followed by the frequent 2 itemsets, and so on.

Lattice can also be traversed in a depth first way.

An algorithm can start from, say, node a and count its support to determine whether it is frequent.

If so, the algorithm progressively expands the next level of nodes, i.e., ab, abc, and so on, until an infrequent node is reached, say, abcd.

It then backtracks to another branch, say, abce, …

Page 5: Association Analysis (3)

Depth first approach is often used by algorithms designed to find max- freq itemsets.

–Allows the frequent itemset border to be detected more quickly.

Once a max-freq itemset is found, substantial pruning can be performed on its subsets.

–E.g., if bcde is maximal frequent, then the algorithm does not have to visit the subtrees rooted at bd, be, c, d, and e because they will not contain any max-freq itemsets.

Page 6: Association Analysis (3)

• Representation of Database– Horizontal vs vertical data layout– APRIORI uses horizontal layout– Vertical: For each item, store a list of transaction ids (tids)– ECLAT uses vertical layout

Alternative Methods for Frequent Itemset Generation

TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D

10 B

HorizontalData Layout

A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109

Vertical Data Layout

TID-list

Page 7: Association Analysis (3)

ECLAT• Determine support of any k-itemset by intersecting TID-lists of

two of its (k-1) subsets.

• Advantage: very fast support counting• Disadvantage: intermediate tid-lists may become too large for

memory

A1456789

B1257810

AB1578

Page 8: Association Analysis (3)

FP-Tree/FP-Growth Algorithm• Use a compressed representation of the database using an FP-

tree

• Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets.

Building the FP-Tree

1. Scan data to determine the support count of each item. Infrequent items are discarded, while the frequent items are sorted in decreasing support counts.

2. Make a second pass over the data to construct the FP tree.As the transactions are read, before being processed, their items are sorted according to the above order.

Page 9: Association Analysis (3)

First scan – determine frequent 1-itemsets, then build header

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

B 8

A 7

C 7

D 5

E 3

Page 10: Association Analysis (3)

FP-tree construction

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

B:1

A:1

After reading TID=1:

After reading TID=2:null

B:2

A:1C:1

D:1

Page 11: Association Analysis (3)

FP-Tree ConstructionTID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

Transaction Database

Item PointerB 8A 7C 7D 5E 3

Header table

B:8

A:5

null

C:3

D:1

A:2

C:1

D:1

E:1

D:1

E:1C:3

D:1

D:1 E:1

Chain pointers help in quickly finding all the paths of the tree containing some given item.

Page 12: Association Analysis (3)

FP-Tree size• The size of an FP tree is typically smaller than the size of the uncompressed

data because many transactions often share a few items in common.

• Best case scenario:

– All transactions have the same set of items, and the FP tree contains only a single branch of nodes.

• Worst case scenario:

– Every transaction has a unique set of items.

– As none of the transactions have any items in common, the size of the FP tree is effectively the same as the size of the original data.

• The size of an FP tree also depends on how the items are ordered.

– If the ordering scheme in the preceding example is reversed, • i.e., from lowest to highest support item, the resulting FP tree probably is

denser (shown in next slide).

• Not always though…ordering is just a heuristic.

Page 13: Association Analysis (3)

An FP tree representation for the data set with a different item ordering scheme.

Page 14: Association Analysis (3)

FP-Growth (I)• FP growth generates frequent itemsets from an FP tree by

exploring the tree in a bottom up fashion.

• Given the example tree, the algorithm looks for frequent itemsets ending in E first, followed by D, C, A, and finally, B.

• Since every transaction is mapped onto a path in the FP tree, we can derive the frequent itemsets ending with a particular item, say, E, by examining only the paths containing node E.

• These paths can be accessed rapidly using the pointers associated with node E.

Page 15: Association Analysis (3)

Paths containing node E

B:3

null

C:3

A:2

C:1

D:1

E:1

D:1

E:1E:1

B:8

A:5

null

C:3

D:1

A:2

C:1

D:1

E:1

D:1

E:1C:3

D:1

D:1 E:1

Page 16: Association Analysis (3)

Conditional FP-Tree for E• We now need to build a conditional FP-Tree for E, which is the

tree of itemsets ending in E.

• It is not the tree obtained in previous slide as result of deleting nodes from the original tree.

• Why? Because the order of the items change. – In this example, C has a higher than B count.

Page 17: Association Analysis (3)

Conditional FP-Tree for E

Adding up the counts for D we get 2, so {E,D} is frequent itemset.

We continue recursively.Base of recursion: When the tree has a single path only.

B:3

null

C:3

A:2

C:1

D:1

E:1

D:1

E:1E:1

The set of paths containing E.

Insert each path (after truncating E) into a new tree.

Item PointerC 4B 3A 2D 2

Header table

The new header C:3

null

B:3

C:1

A:1

D:1

A:1

D:1

The conditional FP-Tree for E

Page 18: Association Analysis (3)

FP-Tree Another Example

A B C E F O

A C G

E I

A C D E G

A C E G L

E J

A B C E F P

A C D

A C E G M

A C E G N

A:8

C:8

E:8

G:5

B:2

D:2

F:2

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

Freq. 1-Itemsets.Supp. Count 2

Transactions Transactions with items sorted based on frequencies, and ignoring the infrequent items.

Page 19: Association Analysis (3)

FP-Tree after reading 1st transaction

A:8

C:8

E:8

G:5

B:2

D:2

F:2

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

null

A:1

C:1

E:1

B:1

F:1

Header

Page 20: Association Analysis (3)

FP-Tree after reading 2nd transactionA C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:2

C:2

E:1

B:1

F:1

Header

Page 21: Association Analysis (3)

FP-Tree after reading 3rd transactionA C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:2

C:2

E:1

B:1

F:1

Header

E:1

Page 22: Association Analysis (3)

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

FP-Tree after reading 4th transaction

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:3

C:3

E:2

B:1

F:1

Header

E:1

G:1

D:1

Page 23: Association Analysis (3)

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

FP-Tree after reading 5th transaction

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:4

C:4

E:3

B:1

F:1

Header

E:1

G:2

D:1

Page 24: Association Analysis (3)

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

FP-Tree after reading 6th transaction

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:4

C:4

E:3

B:1

F:1

Header

E:2

G:2

D:1

Page 25: Association Analysis (3)

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

FP-Tree after reading 7th transaction

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:5

C:5

E:4

B:2

F:2

Header

E:2

G:2

D:1

Page 26: Association Analysis (3)

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

FP-Tree after reading 8th transaction

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:6

C:6

E:4

B:2

F:2

Header

E:2

G:2

D:1

D:1

Page 27: Association Analysis (3)

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

FP-Tree after reading 9th transaction

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:7

C:7

E:5

B:2

F:2

Header

E:2

G:3

D:1

D:1

Page 28: Association Analysis (3)

A C E B F

A C G

E

A C E G D

A C E G

E

A C E B F

A C D

A C E G

A C E G

FP-Tree after reading 10th transaction

G:1

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:8

C:8

E:6

B:2

F:2

Header

E:2

G:4

D:1

D:1

Page 29: Association Analysis (3)

Conditional FP-TreesBuild the conditional FP-Tree for each of the items.

For this:

1. Find the paths containing on focus item. With those paths we build the conditional FP-Tree for the item.

2. Read again the tree to determine the new counts of the items along those paths. Build a new header.

3. Insert the paths in the conditional FP-Tree according to the new order.

Page 30: Association Analysis (3)

Conditional FP-Tree for F

A:8

C:8

E:8

G:5

B:2

D:2

F:2

null

A:8

C:8

E:6

B:2

F:2

Header

There is only a single path containing F

A:2

C:2

E:2

B:2

null

A:2

C:2

E:2

B:2

New Header

Page 31: Association Analysis (3)

Recursion• We continue recursively on the

conditional FP-Tree for F.

• However, when the tree is just a single path it is the base case for the recursion.

• So, we just produce all the subsets of the items on this path merged with F.

{F} {A,F} {C,F} {E,F} {B,F} {A,C,F}, …,

{A,C,E,F}

A:6

C:6

E:5

B:2

null

A:2

C:2

E:2

B:2

New Header

Page 32: Association Analysis (3)

Conditional FP-Tree for D

A:2

C:2

null

A:2

C:2

New Headernull

A:8

C:8

E:6

G:4

D:1

D:1

Paths containing D after updating the counts

The other items are removed as infrequent.

The tree is just a single path; it is the base case for the recursion.

So, we just produce all the subsets of the items on this path merged with D.

{D} {A,D} {C,D} {A,C,D}

Exercise: Complete the example.