new data mining techniques - northeastern university · 2016. 11. 10. · data mining techniques cs...
TRANSCRIPT
![Page 1: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/1.jpg)
Data Mining TechniquesCS 6220 - Section 3 - Fall 2016
Lecture 16: Association RulesJan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)
![Page 2: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/2.jpg)
Apriori: Summary
C1 L1 C2 L2 C3Filte
r
Filte
r
ConstructConstruct
All items
All pairs of items from L1
Count the pairs
All pairs of setsthat differ by 1 element
Count the items
1. Set k = 0 2. Define C1 as all size 1 item sets 3. While Ck+1 is not empty 4. Set k = k + 1 5. Scan DB to determine subset Lk ⊆Ck
with support ≥ s 6. Construct candidates Ck+1 by combining
sets in Lk that differ by 1 element
![Page 3: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/3.jpg)
Apriori: Bottlenecks
C1 L1 C2 L2 C3Filte
r
Filte
r
ConstructConstruct
All items
All pairs of items from L1
Count the pairs
All pairs of setsthat differ by 1 element
Count the items
1. Set k = 0 2. Define C1 as all size 1 item sets 3. While Ck+1 is not empty 4. Set k = k + 1 5. Scan DB to determine subset Lk ⊆Ck
with support ≥ s 6. Construct candidates Ck+1 by combining
sets in Lk that differ by 1 element
(I/O limited)
(Memory limited)
![Page 4: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/4.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Apriori: Main-Memory BottleneckFor many frequent-itemset algorithms, main-memory is the critical resource ▪ As we read baskets, we need to count
something, e.g., occurrences of pairs of items ▪ The number of different things we can count
is limited by main memory ▪ For typical market-baskets and reasonable
support (e.g., 1%), k = 2 requires most memory ▪ Swapping counts in/out is a disaster (why?)
![Page 5: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/5.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Counting Pairs in MemoryTwo approaches:
Approach 1: Count all pairs using a matrix Approach 2: Keep a table of triples [i, j, c] = “the count of the pair of items {i, j} is c.” ▪ If integers and item ids are 4 bytes, we need
approximately 12 bytes for pairs with count > 0 ▪ Plus some additional overhead for the hashtable
Note: Approach 1 only requires 4 bytes per pair Approach 2 uses 12 bytes per pair (but only for pairs with count > 0)
![Page 6: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/6.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Comparing the 2 Approaches
4bytesperpair
Triangular Matrix Triples
12peroccurringpair
![Page 7: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/7.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Comparing the two approachesApproach 1: Triangular Matrix ▪ n = total number items ▪ Count pair of items {i, j} only if i<j ▪ Keep pair counts in lexicographic order: ▪ {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},… ▪ Pair {i, j} is at position (i –1)(n– i/2) + j –1 ▪ Total number of pairs n(n –1)/2; total bytes= 2n2 ▪ Triangular Matrix requires 4 bytes per pair Approach 2 uses 12 bytes per occurring pair (but only for pairs with count > 0) ▪ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
![Page 8: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/8.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Main-Memory: Picture of Apriori
Item counts
Pass 1 Pass 2
Frequent items
Mai
n m
emor
y Counts of pairs of frequent items (candidate
pairs)
![Page 9: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/9.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
PCY (Park-Chen-Yu) AlgorithmObservation: In pass 1 of Apriori, most memory is idle ▪ We store only individual item counts ▪ Can we reduce the number of candidates C2
(therefore the memory required) in pass 2?
Pass 1 of PCY: In addition to item counts, maintain a hash table with as many buckets as fit in memory ▪ Keep a count for each bucket into which
pairs of items are hashed ▪ For each bucket just keep the count, not the actual
pairs that hash to the bucket!
![Page 10: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/10.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
PCY Algorithm – First Pass FOR (each basket):
FOR (each item in the basket):add 1 to item’s count;
FOR (each pair of items):hash the pair to a bucket;add 1 to the count for that bucket;
Few things to note: ▪ Pairs of items need to be generated from
the input file; they are not present in the file ▪ We are not just interested in the presence of a pair,
but whether it is present at least s (support) times
New in PCY
![Page 11: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/11.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Eliminating Candidates using Buckets
Observation: If a bucket contains a frequent pair, then the bucket is surely frequent However, even without any frequent pair, a bucket can still be frequent ▪ So, we cannot use the hash to eliminate any
member (pair) of a “frequent” bucket But, for a bucket with total count less than s, none of its pairs can be frequent ▪ Pairs that hash to this bucket can be eliminated as candidates
(even if the pair consists of 2 frequent items)
Pass 2: Only count pairs that hash to frequent buckets
![Page 12: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/12.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
PCY Algorithm – Between PassesReplace the buckets by a bit-vector: ▪ 1 means the bucket count exceeded s
(call it a frequent bucket); 0 means it did not
4-byte integer counts are replaced by bits, so the bit-vector requires 1/32 of memory
Also, decide which items are frequent and list them for the second pass
![Page 13: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/13.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
PCY Algorithm – Pass 2Count all pairs {i, j} that meet the conditions for being a candidate pair:
1. Both i and j are frequent items 2. The pair {i, j} hashes to a bucket whose bit in
the bit vector is 1 (i.e., a frequent bucket)
Both conditions are necessary for the pair to have a chance of being frequent
![Page 14: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/14.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
PCY Algorithm – Summary
1. Set k = 0 2. Define C1 as all size 1 item sets 3. Scan DB to construct L1 ⊆ C1
and a hash table of pair counts 4. Convert pair counts to bit vector
and construct candidates C2 5. While Ck+1 is not empty 6. Set k = k + 1 7. Scan DB to determine subset Lk ⊆Ck
with support ≥ s 8. Construct candidates Ck+1 by combining
sets in Lk that differ by 1 element
New in PCY
![Page 15: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/15.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Main-Memory: Picture of PCY
Hash table
Item counts
Bitmap
Pass 1 Pass 2
Frequent items
Hash tablefor pairs
Mai
n m
emor
y
Counts of candidate
pairs
![Page 16: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/16.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Main-Memory DetailsBuckets require a few bytes each: ▪ Note: we do not have to count past s ▪ #buckets is O(main-memory size)
On second pass, a table of (item, item, count) triples is essential (we cannot use triangular matrix approach, why?) ▪ Thus, hash table must eliminate approx. 2/3
of the candidate pairs for PCY to beat A-Priori
![Page 17: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/17.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Refinement: Multistage AlgorithmLimit the number of candidates to be counted ▪ Remember: Memory is the bottleneck ▪ Still need to generate all the itemsets but we only want
to count/keep track of the ones that are frequent Key idea: After Pass 1 of PCY, rehash only those pairs that qualify for Pass 2 of PCY ▪ i and j are frequent, and ▪ {i, j} hashes to a frequent bucket from Pass 1 On middle pass, fewer pairs contribute to buckets, so fewer false positives Requires 3 passes over the data
![Page 18: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/18.jpg)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Main-memory: Multistage PCY
First hash table
Item counts
Bitmap 1 Bitmap 1
Bitmap 2
Freq. items Freq. items
Counts of candidate pairs
Pass 1 Pass 2 Pass 3
Count items Hash pairs {i,j}
Hash pairs {i,j}into Hash2 iff:
i,j are frequent,{i,j} hashes to
freq. bucket in B1
Count pairs {i,j} iff: i,j are frequent,{i,j} hashes to
freq. bucket in B1 {i,j} hashes to
freq. bucket in B2
First hash table Second
hash table Counts of candidate
pairs
Mai
n m
emor
y
![Page 19: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/19.jpg)
Apriori: Bottlenecks
C1 L1 C2 L2 C3Filte
r
Filte
r
ConstructConstruct
All items
All pairs of items from L1
Count the pairs
All pairs of setsthat differ by 1 element
Count the items
1. Set k = 0 2. Define C1 as all size 1 item sets 3. While Ck+1 is not empty 4. Set k = k + 1 5. Scan DB to determine subset Lk ⊆Ck
with support ≥ s 6. Construct candidates Ck+1 by combining
sets in Lk that differ by 1 element
(I/O limited)
(Memorylimited)
![Page 20: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/20.jpg)
FP-Growth Algorithm – Overview• Apriori requires one pass for each k
(2+ on first pass for PCY variants) • Can we find all frequent item sets
in fewer passes over the data?
FP-Growth Algorithm: • Pass 1: Count items with support ≥ s • Sort frequent items in descending
order according to count • Pass 2: Store all frequent itemsets
in a frequent pattern tree (FP-tree)• Mine patterns from FP-Tree
![Page 21: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/21.jpg)
FP-Tree Construction
FP-tree Construction
obtain all frequent
items in descending order according to their support counts.
TID Items Bought Frequent Items1 {a,b,f} {a,b}2 {b,g,c,d} {b,c,d}3 {h, a,c,d,e} {a,c,d,e}4 {a,d, p,e} {a,d,e}5 {a,b,c} {a,b,c}6 {a,b,q,c,d} {a,b,c,d}7 {a} {a}8 {a,m,b,c} {a,b,c}9 {a,b,n,d} {a,b,d}10 {b,c,e} {b,c,e}
Yijun ZhaoDATA MINING TECHNIQUES Association Rule Mining39 / 55
364 Chapter 6 Association Analysis
{a,b}
{a}
{a,b,c}
{a,b,d}{b,c,e}
{a,b,c}
{a,b,c,d}
{b,c,d}
{a,d,e}
{a,c,d,e}
TransactionData Set
1234567
89
10
TID Items
null
nulla:1 b:1
c:1
d:1
b:1
null
a:2 b:1
c:1c:1
d:1d:1
e:1
b:1
a:1
b:1
(i) After reading TID=1
(iii) After reading TID=3
(iv) After reading TID=10
(ii) After reading TID=2
null
a:8 b:2
c:2
c:1c:3
d:1d:1
d:1d:1 d:1
e:1e:1e:1
b:5
Figure 6.24. Construction of an FP-tree.
Figure 6.24 shows a data set that contains ten transactions and five items.The structures of the FP-tree after reading the first three transactions are alsodepicted in the diagram. Each node in the tree contains the label of an itemalong with a counter that shows the number of transactions mapped onto thegiven path. Initially, the FP-tree contains only the root node represented bythe null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of eachitem. Infrequent items are discarded, while the frequent items are sortedin decreasing support counts. For the data set shown in Figure 6.24, ais the most frequent item, followed by b, c, d, and e.
364 Chapter 6 Association Analysis
{a,b}
{a}
{a,b,c}
{a,b,d}{b,c,e}
{a,b,c}
{a,b,c,d}
{b,c,d}
{a,d,e}
{a,c,d,e}
TransactionData Set
1234567
89
10
TID Items
null
nulla:1 b:1
c:1
d:1
b:1
null
a:2 b:1
c:1c:1
d:1d:1
e:1
b:1
a:1
b:1
(i) After reading TID=1
(iii) After reading TID=3
(iv) After reading TID=10
(ii) After reading TID=2
null
a:8 b:2
c:2
c:1c:3
d:1d:1
d:1d:1 d:1
e:1e:1e:1
b:5
Figure 6.24. Construction of an FP-tree.
Figure 6.24 shows a data set that contains ten transactions and five items.The structures of the FP-tree after reading the first three transactions are alsodepicted in the diagram. Each node in the tree contains the label of an itemalong with a counter that shows the number of transactions mapped onto thegiven path. Initially, the FP-tree contains only the root node represented bythe null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of eachitem. Infrequent items are discarded, while the frequent items are sortedin decreasing support counts. For the data set shown in Figure 6.24, ais the most frequent item, followed by b, c, d, and e.
364 Chapter 6 Association Analysis
{a,b}
{a}
{a,b,c}
{a,b,d}{b,c,e}
{a,b,c}
{a,b,c,d}
{b,c,d}
{a,d,e}
{a,c,d,e}
TransactionData Set
1234567
89
10
TID Items
null
nulla:1 b:1
c:1
d:1
b:1
null
a:2 b:1
c:1c:1
d:1d:1
e:1
b:1
a:1
b:1
(i) After reading TID=1
(iii) After reading TID=3
(iv) After reading TID=10
(ii) After reading TID=2
null
a:8 b:2
c:2
c:1c:3
d:1d:1
d:1d:1 d:1
e:1e:1e:1
b:5
Figure 6.24. Construction of an FP-tree.
Figure 6.24 shows a data set that contains ten transactions and five items.The structures of the FP-tree after reading the first three transactions are alsodepicted in the diagram. Each node in the tree contains the label of an itemalong with a counter that shows the number of transactions mapped onto thegiven path. Initially, the FP-tree contains only the root node represented bythe null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of eachitem. Infrequent items are discarded, while the frequent items are sortedin decreasing support counts. For the data set shown in Figure 6.24, ais the most frequent item, followed by b, c, d, and e.
364 Chapter 6 Association Analysis
{a,b}
{a}
{a,b,c}
{a,b,d}{b,c,e}
{a,b,c}
{a,b,c,d}
{b,c,d}
{a,d,e}
{a,c,d,e}
TransactionData Set
1234567
89
10
TID Items
null
nulla:1 b:1
c:1
d:1
b:1
null
a:2 b:1
c:1c:1
d:1d:1
e:1
b:1
a:1
b:1
(i) After reading TID=1
(iii) After reading TID=3
(iv) After reading TID=10
(ii) After reading TID=2
null
a:8 b:2
c:2
c:1c:3
d:1d:1
d:1d:1 d:1
e:1e:1e:1
b:5
Figure 6.24. Construction of an FP-tree.
Figure 6.24 shows a data set that contains ten transactions and five items.The structures of the FP-tree after reading the first three transactions are alsodepicted in the diagram. Each node in the tree contains the label of an itemalong with a counter that shows the number of transactions mapped onto thegiven path. Initially, the FP-tree contains only the root node represented bythe null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of eachitem. Infrequent items are discarded, while the frequent items are sortedin decreasing support counts. For the data set shown in Figure 6.24, ais the most frequent item, followed by b, c, d, and e.
TID = 1 TID = 2 TID = 3
TID = 10
a: 8, b: 7, c: 6, d: 5, e: 3, f: 1, g: 1, h: 1, m: 1, n: 1
![Page 22: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/22.jpg)
Mining Patterns from the FP-Tree
Subtree e
a: 8, b: 7, c: 6, d: 5, e: 3, f: 1, g: 1, h: 1, m: 1, n: 1
6.6 FP-Growth Algorithm 367
null
null
a:8 b:2
b:2
b:2
b:5b:5
c:1
c:1c:3
c:2
c:2
d:1 d:1d:1 d:1 d:1
c:3 c:2
b:2b:5
c:1d:1
d:1
e:1 e:1e:1
null null
null
a:8 a:8
a:8
a:8
(c) Paths containing node c (d) Paths containing node b (e) Paths containing node a
(a) Paths containing node e (b) Paths containing node d
Figure 6.26. Decomposing the frequent itemset generation problem into multiple subproblems, where
each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.
Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.
Suffix Frequent Itemsetse {e}, {d,e}, {a,d,e}, {c,e},{a,e}d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}c {c}, {b,c}, {a,b,c}, {a,c}b {b}, {a,b}a {a}
After finding the frequent itemsets ending in e, the algorithm proceeds tolook for frequent itemsets ending in d by processing the paths associated withnode d. The corresponding paths are shown in Figure 6.26(b). This processcontinues until all the paths associated with nodes c, b, and finally a, areprocessed. The paths for these items are shown in Figures 6.26(c), (d), and(e), while their corresponding frequent itemsets are summarized in Table 6.6.
FP-growth finds all the frequent itemsets ending with a particular suffixby employing a divide-and-conquer strategy to split the problem into smallersubproblems. For example, suppose we are interested in finding all frequent
6.6 FP-Growth Algorithm 367
null
null
a:8 b:2
b:2
b:2
b:5b:5
c:1
c:1c:3
c:2
c:2
d:1 d:1d:1 d:1 d:1
c:3 c:2
b:2b:5
c:1d:1
d:1
e:1 e:1e:1
null null
null
a:8 a:8
a:8
a:8
(c) Paths containing node c (d) Paths containing node b (e) Paths containing node a
(a) Paths containing node e (b) Paths containing node d
Figure 6.26. Decomposing the frequent itemset generation problem into multiple subproblems, where
each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.
Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.
Suffix Frequent Itemsetse {e}, {d,e}, {a,d,e}, {c,e},{a,e}d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}c {c}, {b,c}, {a,b,c}, {a,c}b {b}, {a,b}a {a}
After finding the frequent itemsets ending in e, the algorithm proceeds tolook for frequent itemsets ending in d by processing the paths associated withnode d. The corresponding paths are shown in Figure 6.26(b). This processcontinues until all the paths associated with nodes c, b, and finally a, areprocessed. The paths for these items are shown in Figures 6.26(c), (d), and(e), while their corresponding frequent itemsets are summarized in Table 6.6.
FP-growth finds all the frequent itemsets ending with a particular suffixby employing a divide-and-conquer strategy to split the problem into smallersubproblems. For example, suppose we are interested in finding all frequent
6.6 FP-Growth Algorithm 367
null
null
a:8 b:2
b:2
b:2
b:5b:5
c:1
c:1c:3
c:2
c:2
d:1 d:1d:1 d:1 d:1
c:3 c:2
b:2b:5
c:1d:1
d:1
e:1 e:1e:1
null null
null
a:8 a:8
a:8
a:8
(c) Paths containing node c (d) Paths containing node b (e) Paths containing node a
(a) Paths containing node e (b) Paths containing node d
Figure 6.26. Decomposing the frequent itemset generation problem into multiple subproblems, where
each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.
Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.
Suffix Frequent Itemsetse {e}, {d,e}, {a,d,e}, {c,e},{a,e}d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}c {c}, {b,c}, {a,b,c}, {a,c}b {b}, {a,b}a {a}
After finding the frequent itemsets ending in e, the algorithm proceeds tolook for frequent itemsets ending in d by processing the paths associated withnode d. The corresponding paths are shown in Figure 6.26(b). This processcontinues until all the paths associated with nodes c, b, and finally a, areprocessed. The paths for these items are shown in Figures 6.26(c), (d), and(e), while their corresponding frequent itemsets are summarized in Table 6.6.
FP-growth finds all the frequent itemsets ending with a particular suffixby employing a divide-and-conquer strategy to split the problem into smallersubproblems. For example, suppose we are interested in finding all frequent
6.6 FP-Growth Algorithm 367
null
null
a:8 b:2
b:2
b:2
b:5b:5
c:1
c:1c:3
c:2
c:2
d:1 d:1d:1 d:1 d:1
c:3 c:2
b:2b:5
c:1d:1
d:1
e:1 e:1e:1
null null
null
a:8 a:8
a:8
a:8
(c) Paths containing node c (d) Paths containing node b (e) Paths containing node a
(a) Paths containing node e (b) Paths containing node d
Figure 6.26. Decomposing the frequent itemset generation problem into multiple subproblems, where
each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.
Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.
Suffix Frequent Itemsetse {e}, {d,e}, {a,d,e}, {c,e},{a,e}d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}c {c}, {b,c}, {a,b,c}, {a,c}b {b}, {a,b}a {a}
After finding the frequent itemsets ending in e, the algorithm proceeds tolook for frequent itemsets ending in d by processing the paths associated withnode d. The corresponding paths are shown in Figure 6.26(b). This processcontinues until all the paths associated with nodes c, b, and finally a, areprocessed. The paths for these items are shown in Figures 6.26(c), (d), and(e), while their corresponding frequent itemsets are summarized in Table 6.6.
FP-growth finds all the frequent itemsets ending with a particular suffixby employing a divide-and-conquer strategy to split the problem into smallersubproblems. For example, suppose we are interested in finding all frequent
6.6 FP-Growth Algorithm 367
null
null
a:8 b:2
b:2
b:2
b:5b:5
c:1
c:1c:3
c:2
c:2
d:1 d:1d:1 d:1 d:1
c:3 c:2
b:2b:5
c:1d:1
d:1
e:1 e:1e:1
null null
null
a:8 a:8
a:8
a:8
(c) Paths containing node c (d) Paths containing node b (e) Paths containing node a
(a) Paths containing node e (b) Paths containing node d
Figure 6.26. Decomposing the frequent itemset generation problem into multiple subproblems, where
each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.
Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.
Suffix Frequent Itemsetse {e}, {d,e}, {a,d,e}, {c,e},{a,e}d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}c {c}, {b,c}, {a,b,c}, {a,c}b {b}, {a,b}a {a}
After finding the frequent itemsets ending in e, the algorithm proceeds tolook for frequent itemsets ending in d by processing the paths associated withnode d. The corresponding paths are shown in Figure 6.26(b). This processcontinues until all the paths associated with nodes c, b, and finally a, areprocessed. The paths for these items are shown in Figures 6.26(c), (d), and(e), while their corresponding frequent itemsets are summarized in Table 6.6.
FP-growth finds all the frequent itemsets ending with a particular suffixby employing a divide-and-conquer strategy to split the problem into smallersubproblems. For example, suppose we are interested in finding all frequent
Subtree d
Subtree c Subtree b Subtree a
364 Chapter 6 Association Analysis
{a,b}
{a}
{a,b,c}
{a,b,d}{b,c,e}
{a,b,c}
{a,b,c,d}
{b,c,d}
{a,d,e}
{a,c,d,e}
TransactionData Set
1234567
89
10
TID Items
null
nulla:1 b:1
c:1
d:1
b:1
null
a:2 b:1
c:1c:1
d:1d:1
e:1
b:1
a:1
b:1
(i) After reading TID=1
(iii) After reading TID=3
(iv) After reading TID=10
(ii) After reading TID=2
null
a:8 b:2
c:2
c:1c:3
d:1d:1
d:1d:1 d:1
e:1e:1e:1
b:5
Figure 6.24. Construction of an FP-tree.
Figure 6.24 shows a data set that contains ten transactions and five items.The structures of the FP-tree after reading the first three transactions are alsodepicted in the diagram. Each node in the tree contains the label of an itemalong with a counter that shows the number of transactions mapped onto thegiven path. Initially, the FP-tree contains only the root node represented bythe null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of eachitem. Infrequent items are discarded, while the frequent items are sortedin decreasing support counts. For the data set shown in Figure 6.24, ais the most frequent item, followed by b, c, d, and e.
Full Tree
Step 1: Extract subtrees ending in each item
![Page 23: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/23.jpg)
Mining Patterns from the FP-Tree
Subtree e 6.6 FP-Growth Algorithm 367
null
null
a:8 b:2
b:2
b:2
b:5b:5
c:1
c:1c:3
c:2
c:2
d:1 d:1d:1 d:1 d:1
c:3 c:2
b:2b:5
c:1d:1
d:1
e:1 e:1e:1
null null
null
a:8 a:8
a:8
a:8
(c) Paths containing node c (d) Paths containing node b (e) Paths containing node a
(a) Paths containing node e (b) Paths containing node d
Figure 6.26. Decomposing the frequent itemset generation problem into multiple subproblems, where
each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.
Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.
Suffix Frequent Itemsetse {e}, {d,e}, {a,d,e}, {c,e},{a,e}d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}c {c}, {b,c}, {a,b,c}, {a,c}b {b}, {a,b}a {a}
After finding the frequent itemsets ending in e, the algorithm proceeds tolook for frequent itemsets ending in d by processing the paths associated withnode d. The corresponding paths are shown in Figure 6.26(b). This processcontinues until all the paths associated with nodes c, b, and finally a, areprocessed. The paths for these items are shown in Figures 6.26(c), (d), and(e), while their corresponding frequent itemsets are summarized in Table 6.6.
FP-growth finds all the frequent itemsets ending with a particular suffixby employing a divide-and-conquer strategy to split the problem into smallersubproblems. For example, suppose we are interested in finding all frequent
Conditional e
364 Chapter 6 Association Analysis
{a,b}
{a}
{a,b,c}
{a,b,d}{b,c,e}
{a,b,c}
{a,b,c,d}
{b,c,d}
{a,d,e}
{a,c,d,e}
TransactionData Set
1234567
89
10
TID Items
null
nulla:1 b:1
c:1
d:1
b:1
null
a:2 b:1
c:1c:1
d:1d:1
e:1
b:1
a:1
b:1
(i) After reading TID=1
(iii) After reading TID=3
(iv) After reading TID=10
(ii) After reading TID=2
null
a:8 b:2
c:2
c:1c:3
d:1d:1
d:1d:1 d:1
e:1e:1e:1
b:5
Figure 6.24. Construction of an FP-tree.
Figure 6.24 shows a data set that contains ten transactions and five items.The structures of the FP-tree after reading the first three transactions are alsodepicted in the diagram. Each node in the tree contains the label of an itemalong with a counter that shows the number of transactions mapped onto thegiven path. Initially, the FP-tree contains only the root node represented bythe null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of eachitem. Infrequent items are discarded, while the frequent items are sortedin decreasing support counts. For the data set shown in Figure 6.24, ais the most frequent item, followed by b, c, d, and e.
Full Tree
Step 2: Construct Conditional FP-Tree for each item
368 Chapter 6 Association Analysis
null
a:8a:2
a:2
a:2
b:2
c:1c:1
c:1
c:1c:2
d:1
d:1 d:1
d:1d:1
d:1
e:1 e:1e:1
null
null null
a:2
c:1 c:1
null
(a) Prefix paths ending in e (b) Conditional FP-tree for e
(c) Prefix paths ending in de (d) Conditional FP-tree for de
(e) Prefix paths ending in ce (f) Prefix paths ending in ae
a:2
null
Figure 6.27. Example of applying the FP-growth algorithm to find frequent itemsets ending in e.
itemsets ending in e. To do this, we must first check whether the itemset{e} itself is frequent. If it is frequent, we consider the subproblem of findingfrequent itemsets ending in de, followed by ce, be, and ae. In turn, eachof these subproblems are further decomposed into smaller subproblems. Bymerging the solutions obtained from the subproblems, all the frequent itemsetsending in e can be found. This divide-and-conquer approach is the key strategyemployed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, considerthe task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initialpaths are called prefix paths and are shown in Figure 6.27(a).
2. From the prefix paths shown in Figure 6.27(a), the support count for e isobtained by adding the support counts associated with node e. Assumingthat the minimum support count is 2, {e} is declared a frequent itemsetbecause its support count is 3.
• Calculate counts for paths ending in e • Remove leaf nodes • Prune nodes with count ≤ s
Conditional Pattern Base for e acd: 1, ad: 1, bc: 1
Conditional Node Counts a: 2, b: 1, c: 2, d: 2
![Page 24: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/24.jpg)
Mining Patterns from the FP-Tree
Conditional e
Step 3: Recursively mine conditional FP-Tree for each item
368 Chapter 6 Association Analysis
null
a:8a:2
a:2
a:2
b:2
c:1c:1
c:1
c:1c:2
d:1
d:1 d:1
d:1d:1
d:1
e:1 e:1e:1
null
null null
a:2
c:1 c:1
null
(a) Prefix paths ending in e (b) Conditional FP-tree for e
(c) Prefix paths ending in de (d) Conditional FP-tree for de
(e) Prefix paths ending in ce (f) Prefix paths ending in ae
a:2
null
Figure 6.27. Example of applying the FP-growth algorithm to find frequent itemsets ending in e.
itemsets ending in e. To do this, we must first check whether the itemset{e} itself is frequent. If it is frequent, we consider the subproblem of findingfrequent itemsets ending in de, followed by ce, be, and ae. In turn, eachof these subproblems are further decomposed into smaller subproblems. Bymerging the solutions obtained from the subproblems, all the frequent itemsetsending in e can be found. This divide-and-conquer approach is the key strategyemployed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, considerthe task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initialpaths are called prefix paths and are shown in Figure 6.27(a).
2. From the prefix paths shown in Figure 6.27(a), the support count for e isobtained by adding the support counts associated with node e. Assumingthat the minimum support count is 2, {e} is declared a frequent itemsetbecause its support count is 3.
Subtree de
368 Chapter 6 Association Analysis
null
a:8a:2
a:2
a:2
b:2
c:1c:1
c:1
c:1c:2
d:1
d:1 d:1
d:1d:1
d:1
e:1 e:1e:1
null
null null
a:2
c:1 c:1
null
(a) Prefix paths ending in e (b) Conditional FP-tree for e
(c) Prefix paths ending in de (d) Conditional FP-tree for de
(e) Prefix paths ending in ce (f) Prefix paths ending in ae
a:2
null
Figure 6.27. Example of applying the FP-growth algorithm to find frequent itemsets ending in e.
itemsets ending in e. To do this, we must first check whether the itemset{e} itself is frequent. If it is frequent, we consider the subproblem of findingfrequent itemsets ending in de, followed by ce, be, and ae. In turn, eachof these subproblems are further decomposed into smaller subproblems. Bymerging the solutions obtained from the subproblems, all the frequent itemsetsending in e can be found. This divide-and-conquer approach is the key strategyemployed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, considerthe task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initialpaths are called prefix paths and are shown in Figure 6.27(a).
2. From the prefix paths shown in Figure 6.27(a), the support count for e isobtained by adding the support counts associated with node e. Assumingthat the minimum support count is 2, {e} is declared a frequent itemsetbecause its support count is 3.
368 Chapter 6 Association Analysis
null
a:8a:2
a:2
a:2
b:2
c:1c:1
c:1
c:1c:2
d:1
d:1 d:1
d:1d:1
d:1
e:1 e:1e:1
null
null null
a:2
c:1 c:1
null
(a) Prefix paths ending in e (b) Conditional FP-tree for e
(c) Prefix paths ending in de (d) Conditional FP-tree for de
(e) Prefix paths ending in ce (f) Prefix paths ending in ae
a:2
null
Figure 6.27. Example of applying the FP-growth algorithm to find frequent itemsets ending in e.
itemsets ending in e. To do this, we must first check whether the itemset{e} itself is frequent. If it is frequent, we consider the subproblem of findingfrequent itemsets ending in de, followed by ce, be, and ae. In turn, eachof these subproblems are further decomposed into smaller subproblems. Bymerging the solutions obtained from the subproblems, all the frequent itemsetsending in e can be found. This divide-and-conquer approach is the key strategyemployed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, considerthe task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initialpaths are called prefix paths and are shown in Figure 6.27(a).
2. From the prefix paths shown in Figure 6.27(a), the support count for e isobtained by adding the support counts associated with node e. Assumingthat the minimum support count is 2, {e} is declared a frequent itemsetbecause its support count is 3.
368 Chapter 6 Association Analysis
null
a:8a:2
a:2
a:2
b:2
c:1c:1
c:1
c:1c:2
d:1
d:1 d:1
d:1d:1
d:1
e:1 e:1e:1
null
null null
a:2
c:1 c:1
null
(a) Prefix paths ending in e (b) Conditional FP-tree for e
(c) Prefix paths ending in de (d) Conditional FP-tree for de
(e) Prefix paths ending in ce (f) Prefix paths ending in ae
a:2
null
Figure 6.27. Example of applying the FP-growth algorithm to find frequent itemsets ending in e.
itemsets ending in e. To do this, we must first check whether the itemset{e} itself is frequent. If it is frequent, we consider the subproblem of findingfrequent itemsets ending in de, followed by ce, be, and ae. In turn, eachof these subproblems are further decomposed into smaller subproblems. Bymerging the solutions obtained from the subproblems, all the frequent itemsetsending in e can be found. This divide-and-conquer approach is the key strategyemployed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, considerthe task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initialpaths are called prefix paths and are shown in Figure 6.27(a).
2. From the prefix paths shown in Figure 6.27(a), the support count for e isobtained by adding the support counts associated with node e. Assumingthat the minimum support count is 2, {e} is declared a frequent itemsetbecause its support count is 3.
368 Chapter 6 Association Analysis
null
a:8a:2
a:2
a:2
b:2
c:1c:1
c:1
c:1c:2
d:1
d:1 d:1
d:1d:1
d:1
e:1 e:1e:1
null
null null
a:2
c:1 c:1
null
(a) Prefix paths ending in e (b) Conditional FP-tree for e
(c) Prefix paths ending in de (d) Conditional FP-tree for de
(e) Prefix paths ending in ce (f) Prefix paths ending in ae
a:2
null
Figure 6.27. Example of applying the FP-growth algorithm to find frequent itemsets ending in e.
itemsets ending in e. To do this, we must first check whether the itemset{e} itself is frequent. If it is frequent, we consider the subproblem of findingfrequent itemsets ending in de, followed by ce, be, and ae. In turn, eachof these subproblems are further decomposed into smaller subproblems. Bymerging the solutions obtained from the subproblems, all the frequent itemsetsending in e can be found. This divide-and-conquer approach is the key strategyemployed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, considerthe task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initialpaths are called prefix paths and are shown in Figure 6.27(a).
2. From the prefix paths shown in Figure 6.27(a), the support count for e isobtained by adding the support counts associated with node e. Assumingthat the minimum support count is 2, {e} is declared a frequent itemsetbecause its support count is 3.
Conditional de
Subtree ce Subtree ae
![Page 25: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/25.jpg)
Mining Patterns from the FP-Tree
FP-Growth Algorithm
Conditional Pattern Base for each su�x:
Su�x Conditional Pattern Base
e acd:1; ad:1; bc:1
d abc:1; ab:1; ac:1; a:1; bc:1
c ab:3; a:1; b:2
b a:5
a �
Step 3: For each Conditional Pattern Base,construct a FP-tree and mine the subtree recursively
Yijun ZhaoDATA MINING TECHNIQUES Association Rule Mining46 / 55
364 Chapter 6 Association Analysis
{a,b}
{a}
{a,b,c}
{a,b,d}{b,c,e}
{a,b,c}
{a,b,c,d}
{b,c,d}
{a,d,e}
{a,c,d,e}
TransactionData Set
1234567
89
10
TID Items
null
nulla:1 b:1
c:1
d:1
b:1
null
a:2 b:1
c:1c:1
d:1d:1
e:1
b:1
a:1
b:1
(i) After reading TID=1
(iii) After reading TID=3
(iv) After reading TID=10
(ii) After reading TID=2
null
a:8 b:2
c:2
c:1c:3
d:1d:1
d:1d:1 d:1
e:1e:1e:1
b:5
Figure 6.24. Construction of an FP-tree.
Figure 6.24 shows a data set that contains ten transactions and five items.The structures of the FP-tree after reading the first three transactions are alsodepicted in the diagram. Each node in the tree contains the label of an itemalong with a counter that shows the number of transactions mapped onto thegiven path. Initially, the FP-tree contains only the root node represented bythe null symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of eachitem. Infrequent items are discarded, while the frequent items are sortedin decreasing support counts. For the data set shown in Figure 6.24, ais the most frequent item, followed by b, c, d, and e.
FP-Growth Algorithm
The list of all frequent itemsets by su�xes:
Su�x Frequent Itemsets
e {e}, {d,e}, {a,d,e}, {c,e}, {a,e}
d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}
c {c}, {b,c}, {a,b,c}, {a,c}
b {b}, {a,b}
a {a}
Yijun ZhaoDATA MINING TECHNIQUES Association Rule Mining48 / 55
![Page 26: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/26.jpg)
Projecting Sub-treesProject Sub-trees
- We cannot really “cut” or “prune” the FP-tree while listing.!- Instead, we create a mirror when we construct conditional FP-trees.!- More memory required than storing the FP-tree.
• “Cutting” and “pruning” trees requires that we create copies/mirrors of the subtrees
• Mining patterns requires additional memory
![Page 27: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/27.jpg)
FP-Growth vs Apriori
(from: Han, Kamber & Pei, Chapter 6)
Simulated data 10k baskets, 25 items on average
![Page 28: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/28.jpg)
FP-Growth vs Apriori
http://singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining
![Page 29: New Data Mining Techniques - Northeastern University · 2016. 11. 10. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent](https://reader035.vdocuments.us/reader035/viewer/2022081617/60520e7df4ce975a2428be3f/html5/thumbnails/29.jpg)
FP-Growth vs AprioriAdvantages of FP-Growth • Only 2 passes over dataset • Stores “compact” version of dataset • No candidate generation • Faster than A-priori
Disadvantages of FP-Growth • The FP-Tree may not be “compact”
enough to fit in memory • Even more memory required to
construct subtrees in mining phase