cs246 association rule mining. junghoo "john" cho (ucla computer science)2 association...
TRANSCRIPT
![Page 1: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/1.jpg)
CS246
Association Rule Mining
![Page 2: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/2.jpg)
Junghoo "John" Cho (UCLA Computer Science) 2
Association Rule Mining
What is the problem? What is an association rule?
![Page 3: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/3.jpg)
Junghoo "John" Cho (UCLA Computer Science) 3
Motivating Problem
If a customer buys, “Diet Coke,” is she likely to buy a nutrition bar? To arrange store shelves, etc. Beer and diaper
Life as a parent is tough…
![Page 4: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/4.jpg)
Junghoo "John" Cho (UCLA Computer Science) 4
Word of Caution
Famous example: David Rhine at Duke Tested students for “extrasensory perception” Asked them to guess 10 cards – red or black 1/1000 of them guess all 10 correctly.
If done many times, some unlikely events happen for purely statistical reasons No physical validity
![Page 5: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/5.jpg)
Junghoo "John" Cho (UCLA Computer Science) 5
Problem Definition
Input: transaction records (set of items)T1: Bread, Milk, Apple
T2: Beer, Chips
T3: Pants, Brush, Toothpaste, Chopstick
… Output: all “association rules”
Bread, Milk Apple If a customer buys bread and milk, he is likely to buy
an apple.
![Page 6: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/6.jpg)
Junghoo "John" Cho (UCLA Computer Science) 6
Confidence Bread Apple:
If a customer buys bread, he is likely to buy an apple.
What does likely mean? A large fraction of baskets with bread also have apple. Formally, P{ I1 | I2 , I3 } > c
c : confidence, say 0.95 Probability to buy an item given other items If a customer buys I2 , I3 , she is likely to buy I1 with 95%
probability “Strength” of the rule
Identify all association rules satisfying confidence threshold c
![Page 7: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/7.jpg)
Junghoo "John" Cho (UCLA Computer Science) 7
Support
Do we really want to find all association rules? If we sell only 5 items of a particular product, who cares
what it is sold with? Find association rules only for the set of items that
appear often enough. Formally, P{ I1 , I2 , I3 } > s
s: support Fraction of records containing the itemset Statistical “significance” I1 , I2 , I3 : frequent itemset
Find association rules for frequent itemsets
![Page 8: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/8.jpg)
Junghoo "John" Cho (UCLA Computer Science) 8
Problem Definition Input: transaction records (set of items)
Output: All association rules
I1 , I2 I3
with support: P{ I1 , I2 , I3 } > s
and confidence: P{ I1 | I2 , I3 } > c
Is the difference between confidence and
support clear?
![Page 9: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/9.jpg)
Junghoo "John" Cho (UCLA Computer Science) 9
Basic Algorithm?
Step 1:Find all frequent itemsets
P{ I1 , I2 , I3 } > s Step 2:
From the large itemsets, identify high confidence rules
P{ I1 | I2 , I3 } > c
![Page 10: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/10.jpg)
Junghoo "John" Cho (UCLA Computer Science) 10
Step 1: Frequent Itemsets
Find all with : frequent itemset
More informally, find all sets of items appearing in more than k transactions
Is it really difficult? How can we solve it?
kIIII ,...,,, 321 sIIIIP k },...,,,{ 321
kIIII ,...,,, 321
![Page 11: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/11.jpg)
Junghoo "John" Cho (UCLA Computer Science) 11
Naïve Approach
Keep counters for all subsets of items {A, B, C}
{A}, {B}, {C}, {AB}, {BC}, {AC} {ABC}
Scan all transaction records and increase counters Transaction {A, C}
{A}++, {C}++, {AC}++ What is difficult?
![Page 12: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/12.jpg)
Junghoo "John" Cho (UCLA Computer Science) 12
Main Challenge?
Problem: 2n subsets for n items 1000 items: 21000 = 10301
Clearly not feasible Lesson: When data size is large, even a
simple problem can be very difficult. What was their main idea?
![Page 13: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/13.jpg)
Junghoo "John" Cho (UCLA Computer Science) 13
Main Idea of Apriori Algorithm
If (A, B, C) is a frequent itemset, (A, B) is a frequent itemset If (A, B) is not a frequent itemset, (A, B, C) cannot
be a frequent itemset Consider (A, B, C) only if all its subsets are
frequent itemsets
},{},,{ BAPCBAP
![Page 14: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/14.jpg)
Junghoo "John" Cho (UCLA Computer Science) 14
Apriori Algorithm
1. L1 = { frequent 1-itemsets }, k = 1
2. Candidate set generation Candidate set Ck : potentially frequent k itemset {A, B, C} is a candidate set iff all its subsets
{A, B}, {B, C} and {A, C} are frequent itemsets Generate candidate set Ck+1 using Lk
3. Scanning Check whether candidate sets are actually frequent
4. Increase k by 1, and go to step 2
![Page 15: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/15.jpg)
Junghoo "John" Cho (UCLA Computer Science) 15
Example
Items: {A, B, C, D} Transactions:
{A, B},
{A, D}
{A, B, C}
{B} Support: 0.5 = 2 transactions
![Page 16: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/16.jpg)
Junghoo "John" Cho (UCLA Computer Science) 16
Example
A B C D
{A, B} {A, D} {A, B, C} {B}
{A,B}
![Page 17: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/17.jpg)
Junghoo "John" Cho (UCLA Computer Science) 17
Why Does Apriori Work?
Typical grocery-store scenario: 100,000 different items 10M baskets with 10 items each (108 items) support = 0.01
Q: How many items can Apriori eliminate? A: At most 1000 items remain (less than 1%)
An item should appear at least 0.01*107 = 105
108 items in total, so 108/105 = 1000 items
![Page 18: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/18.jpg)
Junghoo "John" Cho (UCLA Computer Science) 18
Basic Algorithm
Step 1:Find all frequent itemsets P{ I1 , I2 , I3 } > s Apriori algorithm
Step 2:From the large itemsets, identify high confidence rules
P{ I1 | I2 , I3 } > c
![Page 19: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/19.jpg)
Junghoo "John" Cho (UCLA Computer Science) 19
Step 2: High Confidence Rules
In principle, second step is straightforward:
We already estimated values in the first step Piece of cake. Simple division!
},...,,{
},...,,,{},...,,|{
32
321321
k
kk IIIP
IIIIPIIIIP
},...,,,{ 321 kIIIIP
![Page 20: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/20.jpg)
Junghoo "John" Cho (UCLA Computer Science) 20
More On Step 2
Q: But given a frequent k-itemset, how many potential rules?
A: 2k! Any efficient algorithm?
![Page 21: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/21.jpg)
Junghoo "John" Cho (UCLA Computer Science) 21
Questions (1)
Is support pruning valid? What about Castillo de Ygay ($5000 wine) Caviar? Even if we only sell 100 items, significant profit…
Technically very challenging Finding all association rules without support
pruning Topic of the next paper
![Page 22: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/22.jpg)
Junghoo "John" Cho (UCLA Computer Science) 22
Questions (2)
Is P{Beer|Diaper} > 0.95 really meaningful? What if beer appears in 95% of baskets?
Interest: P{Beer, Diaper} / P{Beer} P{Diaper}
Implication strength:Beer Diaper == ~(Beer, ~Diaper)P{~Diaper} P{Beer} / P{~Diaper, Beer}
![Page 23: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/23.jpg)
Junghoo "John" Cho (UCLA Computer Science) 23
Follow-up Works
Candidate set generation still costly Iceberg queries No candidate set generation stage
Minimizing number of passes
![Page 24: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/24.jpg)
Junghoo "John" Cho (UCLA Computer Science) 24
Mining without Support Pruning
What is the Problem? How can we identify “Castillo de Ygay
Caviar”? Apriori is efficient only for frequent items
Problem definition Data mining: Low support, high correlation Finding rare, but very similar items
![Page 25: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/25.jpg)
Junghoo "John" Cho (UCLA Computer Science) 25
Matrix Representation
Typical scenario 100,000 items 10M baskets with 10 items each
Matrix Columns = items Rows = baskets (i, j) = 1 if item cj is in basket ri
Very sparse: almost all 0’s (less than 0.01% 1’s)
![Page 26: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/26.jpg)
Junghoo "John" Cho (UCLA Computer Science) 26
Matrix Example
a b c d e f g
1 1 0 0 0 0 0
1 0 0 0 0 1 0
0 1 0 1 0 0 1
0 1 0 0 1 0 0
0 0 1 1 1 0 0
1 0 0 0 0 0 0
0 0 0 0 1 1 0
{a, b}
{a, f}
{b, d, g}
{b, e}
{c, d, e}
{a}
{e, f}
![Page 27: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/27.jpg)
Junghoo "John" Cho (UCLA Computer Science) 27
Association Rule and Similarity
Think of column Ci as the set of rows with 1 Association Rule (confidence)
Similarity
|1|
|12|}1|2{
C
CCCCP
|21|
|21|)2,1Sim(
CC
CCCC
![Page 28: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/28.jpg)
Junghoo "John" Cho (UCLA Computer Science) 28
Example
C1 C2
0 1
1 1
1 0
1 1
1 0
0 0
Sim(C1, C2) = 2/5
P(C2|C1) = 2/4
![Page 29: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/29.jpg)
Junghoo "John" Cho (UCLA Computer Science) 29
Problem Definition
Find all highly similar pairs All Ci, Cj with Sim(Ci, Cj) > s* s*: Similarity threshold
![Page 30: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/30.jpg)
Junghoo "John" Cho (UCLA Computer Science) 30
Why Similarity (not Confidence)?
A1: Techniques work only for similarity A2: High similarity implies high confidence
|C1C2| / |C1C2| < |C1C2| / |C1| All similar pairs are of high confidence
![Page 31: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/31.jpg)
Junghoo "John" Cho (UCLA Computer Science) 31
Assumption
Matrix does not fit into main memory Number of columns is relatively small
Can store some information in main memory per each item
Number of rows can be very big Sparse data: mostly 0 in the matrix
![Page 32: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/32.jpg)
Junghoo "John" Cho (UCLA Computer Science) 32
Key Idea?
“Compress” the matrix into a smaller one Load the compressed matrix into main memory
Find high similarity pairs from the compressed matrix Much easier than disk-based computation
![Page 33: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/33.jpg)
Junghoo "John" Cho (UCLA Computer Science) 33
Min-Hash? LSH? Hamming?
What are the for? Min-Hash?: compression LSH?: similarity pair computation Hamming LSH?: compression+similarity
![Page 34: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/34.jpg)
Junghoo "John" Cho (UCLA Computer Science) 34
How To Compress? (1)
“Hash” each column C to a small signature Sig(C) such that Sim(C1, C2) is the same as the “similarity” of
Sig(C1) and Sig(C2) Sig(C) is small enough, so that we can store the
“compressed” matrix in main memory
![Page 35: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/35.jpg)
Junghoo "John" Cho (UCLA Computer Science) 35
How To Compress? (2)
Idea 1 Pick 100 random rows Sig(C1) = the 100 bits of the selected rows Would it work?
Idea 1 does not work Matrix is sparse Most of the columns will be “0000…0” But the columns are different because 1’s are in
different rows
![Page 36: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/36.jpg)
Junghoo "John" Cho (UCLA Computer Science) 36
Min-Hashing
Imagine rows are permuted randomly “Hash” function h(C)
The first row number with 1 in column C
![Page 37: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/37.jpg)
Junghoo "John" Cho (UCLA Computer Science) 37
Example
C1 C2 C3
1 1 0 1
2 0 1 1
3 1 0 0
4 0 1 0
5 1 0 0
Permutation = (45123)
S1 S2 S3
5 4 1
![Page 38: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/38.jpg)
Junghoo "John" Cho (UCLA Computer Science) 38
Important Property
The probability that h(C1) = h(C2) is the same as Sim(C1, C2)
Why?
![Page 39: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/39.jpg)
Junghoo "John" Cho (UCLA Computer Science) 39
Row Types
Given C1 and C2, rows can be classified as
C1 C2
a 1 1
b 1 0
c 0 1
d 0 0
a = # of rows of type a Sim(C1, C2) = a / (a + b + c) Q: What’s P{ h(C1) = h(C2) }? A: a / (a + b + c)
Look down C1 and C2 until we see 1
If it’s type a, then h(C1) = h(C2)If it’s type b or c, not.
![Page 40: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/40.jpg)
Junghoo "John" Cho (UCLA Computer Science) 40
Min-Hash Signature
Pick (say) 100 random permutations of the rows
Get Min-Hash values from each permutation Sig(C) = the list of 100 Min-Hash values Sim( Sig(C1), Sig(C2) ) =
fraction of signatures for which Min-Hash value agrees
![Page 41: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/41.jpg)
Junghoo "John" Cho (UCLA Computer Science) 41
Example
C1 C2 C3
1 1 0 1
2 0 1 1
3 1 0 0
4 1 0 1
5 0 1 0
121
454
453
S3S2S1
Perm1 = (12345)
Perm2 = (54321)
Perm3 = (34512)
Similarities:
1-2 1-3 2-3
Matrix 0 0.5 0.25
Sig 0 0.67 0
![Page 42: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/42.jpg)
Junghoo "John" Cho (UCLA Computer Science) 42
Basic Idea
“Compress” the matrix into a smaller one Min-Hash signature
Find high similarity pairs from the compressed matrix How?
![Page 43: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/43.jpg)
Junghoo "John" Cho (UCLA Computer Science) 43
Problem
From the signature matrix (which fits into main memory), identify all similar pairs
Assuming 100,000 items Potentially 1010 similar pairs? One counter per one pair? No way
How?
![Page 44: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/44.jpg)
Junghoo "John" Cho (UCLA Computer Science) 44
Locality Sensitive Hashing
A technique to limit the number of similar pairs to consider
Approach Using LSH, identify “candidate similar pairs” Scan the Min-Hash signature matrix for
verification
![Page 45: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/45.jpg)
Junghoo "John" Cho (UCLA Computer Science) 45
Locality Sensitive Hashing
Partition the signature matrix into l bands of r rows each
C1 C2 C3 C4 C5 C6 C7
h1
h2
h3
h4
h5
h6
r rows band 1
band 2
…
l bands
![Page 46: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/46.jpg)
Junghoo "John" Cho (UCLA Computer Science) 46
Locality Sensitive Hashing
Hash each column in each band into buckets
C1 C2 C3 C4 C5 C6 C7
h1
h2
h3
h4
h5
h6
![Page 47: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/47.jpg)
Junghoo "John" Cho (UCLA Computer Science) 47
Locality Sensitive Hashing
Two columns are candidate pair if they hash to the same bucket in any band
C1 C2 C3 C4 C5 C6 C7
h1
h2
h3
h4
h5
h6
Candidate pair !
![Page 48: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/48.jpg)
Junghoo "John" Cho (UCLA Computer Science) 48
Locality Sensitive Hashing
Final verification After identifying candidates, verify each
candidate-pair (Ci, Cj) by examining Sig(Ci) and Sig (Cj) for similarity
![Page 49: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/49.jpg)
Junghoo "John" Cho (UCLA Computer Science) 49
Example
100,000 columns 100 Min-Hash integer signature Total signature table size
4 x 100 x 100,000 = 40 MB (not bad) Potential similar pairs
100000 x 100000 / 2 = 5,000,000,000 (too many!) 20 bands of 5 integers per band Compute false positive and false negative rates
![Page 50: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/50.jpg)
Junghoo "John" Cho (UCLA Computer Science) 50
False Negative: 80% Similar
Probability C1, C2 identical in one band
0.8^5 = 0.328 Probability C1, C2 not identical in any of the
20 bands
(1 – 0.328)^20 = 0.00035 We miss only 1/3000 of 80% similar column
pairs! Very few false negative
![Page 51: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/51.jpg)
Junghoo "John" Cho (UCLA Computer Science) 51
False Positive: 40% Similar
Probability C1, C2 identical in one band0.4^5 = 0.01
Probability C1, C2 identical in at least one of the 20 bands
1 – (1 – 0.01)^20 = 0.18 Only about 20% of unsimilar pairs are
identified as candidate pairs False negatives much lower when similarities <<
40%
![Page 52: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/52.jpg)
Junghoo "John" Cho (UCLA Computer Science) 52
LSH Summary
Similar signature column pair identification algorithm Split the signature matrix into l bands of r rows
each Identify almost all similar pairs and a small
number of unsimilar pairs By adjusting r and l
![Page 53: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/53.jpg)
Junghoo "John" Cho (UCLA Computer Science) 53
Hamming LSH
Life is simpler if the matrix has about 50% 1’s We can take a random collection of rows Let us make the matrix denser!
How? Construct a series of matrices by OR-ing together
pairs of rows 0 disappears over time…
![Page 54: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/54.jpg)
Junghoo "John" Cho (UCLA Computer Science) 54
Example
00010010
0101
11
1OR
More 1’s
![Page 55: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/55.jpg)
Junghoo "John" Cho (UCLA Computer Science) 55
Hamming LSH Construct all matrices
No more than log n matrices for n rows Total number of rows in all matrices is 2n
Twice as much work as the original matrix Identify similar columns from each matrix
From each matrix, apply LHS to the columns with density between 30% -- 70% 1’s
Report similar columns Note that similar columns have similar densities, so
they will be considered together in at least one matrix No point ever comparing columns whose number of 1’s are
very different
![Page 56: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/56.jpg)
Junghoo "John" Cho (UCLA Computer Science) 56
Summary
Apriori, Min-Hash, LSH, Hamming LSH Finding frequent pairs?
Apriori Finding similar pairs?
Min-Hash+LSH or Hamming LSH Min-Hash: Sparse matrix compression LSH: Similar signature identification Hamming LSH: Amplification of 1
![Page 57: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/57.jpg)
Junghoo "John" Cho (UCLA Computer Science) 57
Questions
Can we extend the techniques to multiple column rules C1, C2 C3?
![Page 58: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/58.jpg)
Any Questions?
Junghoo "John" Cho (UCLA Computer Science) 58
![Page 59: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/59.jpg)
Junghoo "John" Cho (UCLA Computer Science) 59
AprioriTid (1)
Q: What was the main idea? A: Some transactions may not need to be checked
Candidate itemsets: {A, B}, {A, C} Transaction: {A, D, E, F}? We may eliminate many transactions
Q: How do we know {A, B, E, F} is not necessary? A: When we check {A, B} and {A, C} we can tell that
{A, B, E, F} does not have any candidate sets
![Page 60: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/60.jpg)
Junghoo "John" Cho (UCLA Computer Science) 60
AprioriTid (2)
In each pass, Substitute each transaction with a set of
candidate itemsets Candidate set: {A, B, C}, {A, C, D}, {A, C, M} Transaction
T1: {A, B, C, D, F, G} T1: {{A, B, C}, {A, C, D}}
Candidate itemset {A, C, D} appears in T1 if {A, C} and {A, D} appears in T1
![Page 61: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/61.jpg)
Junghoo "John" Cho (UCLA Computer Science) 61
AprioriTid (3)
Q: Advantage? A: Many transactions/items may be
eliminated Especially in later passes
Q: Disadvantage? A: A transaction may be blown up
T1: {A, B, C, D} T1: {{A, B, C}, {A, B, D}} Why not just eliminate “infrequent items”?
![Page 62: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/62.jpg)
Junghoo "John" Cho (UCLA Computer Science) 62
AprioriHybrid
In earlier passes, use Apriori In later passes, use AprioriTid Switching criteria
Does the generated set of transactions fit in main memory?
)support(ckCallc
![Page 63: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?](https://reader036.vdocuments.us/reader036/viewer/2022062313/56649d355503460f94a0d269/html5/thumbnails/63.jpg)
Junghoo "John" Cho (UCLA Computer Science) 63
History of the paper
Earlier SIGMOD93 paper (AIS Algorithm) Very difficult to read. Poor organization Did not use the “obvious” pruning criteria Very naïve and simple heuristics
Techniques in the paper may not be very important Much more efficient algorithms proposed next year
Even great research starts with small ideas As you can see from the history
Learn how a “simple” idea can change things…