frequent itemset mining methods. the apriori algorithm finding frequent itemsets using candidate...

Download Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and

If you can't read please download the document

Upload: raven-craig

Post on 14-Dec-2015

229 views

Category:

Documents


9 download

TRANSCRIPT

  • Slide 1

Frequent Itemset Mining Methods Slide 2 The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 Uses an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)- itemsets. Apriori property to reduce the search space: All nonempty subsets of a frequent itemset must also be frequent. P(I) I is not frequent P(I+A) I+A is not frequent either Antimonotone property if a set cannot pass a test, all of its supersets will fail the same test as well Slide 3 Using the apriori property in the algorithm: Let us look at how Lk-1 is used to find Lk, for k>=2 Two steps: Join finding Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself The items within a transaction or itemset are sorted in lexicographic order For the (k-1) itemset: li[1]B)=P(B|A)= support_count(AUB)/support_count(A) support_count(AUB) number of transactions containing the itemsets AUB support_count(A) - number of transactions containing the itemsets A Slide 8 for every nonempty susbset s of l, output the rule s=>(l-s) if support_count(l)/support_count(s)>=min_conf Example: lets have l={I1, I2, I5} The nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}. Generating association rules: I1 and I2=>I5conf=2/4=50% I1 and I5=>I2conf=2/2=100% I2 and I5=> I1conf=2/2=100% I1=>I2 and I5conf=2/6=33% I2=>I1 and I5conf=2/7=29% I5=>I1 and I2conf=2/2=100% If min_conf is 70%, then only the second, third and last rules above are output. Slide 9 Improving the efficiency of Apriori Hash-based technique to reduce the size of the candidate k-itemsets, Ck, for k>1 Generate all of the 2-itemsets for each transaction, hash them into a different buckets of a hash table structure H(x,y)=((order of x)X10+(order of y)) mod 7 Transaction reduction a transaction that does not contain any frequent k-itemsets cannot contain any frequent k+1 itemsets. Partitioning partitioning the data to find candidate itemsets Sampling mining on a subset of a given data searching for frequents itemsets in subset S, instead of D Lower support threshold Dynamic itemset counting adding candidate itemsets at different points during a scan Slide 10 Mining Frequent Itemsets without candidate generation The candidate generate and test method Reduces the size of candidates sets Good performance It may need to generate a huge number of candidate sets It may need to repeatedly scan the database and check a large set of candidates by pattern matching Frequent-pattern growth method(FP- growth) frequent pattern tree(FP-tree) Slide 11 Example: Slide 12 I5 (I2, I1, I5:1) (I2, I1, I3, I5:1) I5 is a suffix, so the two prefixes are (I2, I1:1) (I2, I1, I3:1) FP tree: (I2:2, I1:2), I3 is removed because