zijian zheng, ron kohavi, and llew...
TRANSCRIPT
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 1
1
Zijian Zheng, Ron Kohavi, and Llew Mason{zijian,ronnyk,lmason}@bluemartini.com
Blue Martini SoftwareSan Mateo, California
August 11, 2001
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 2
2Performance Improvement IrrelevantPerformance Improvement Irrelevant
0%
Ø Apriori sufficientØ <1,000,000 rulesØ < 10 minutes for all alg.
Ø ImpossibleØ Super exponential
growth in # RulesØ >1,000,000,000 rules
• Interesting• Range as narrow as 0.02%
ò Very narrow min-sup range of interestò Super exponential growth in number of
rules on real world data
ò Fast, but incorrect results
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 3
3
ò Performance improvements on artificial data did not generalize to real world data
1.0 x0.8 x1.1 x1.2 xReal-POS
3.1 x12.1 x2.8 x1.9 x Artificial
AvgFP-growthCharmCloset
Improvement over Apriori:
ò Are algorithms overfitting the artificial data?
0102030405060
0 10 20 30 40Transaction size
Fre
qu
ence
(%) Artificial
Real-Web-1Real-Web-2Real-POS
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 4
4
ò Association rule discovery: Ø Active research areaØ Good application potential
ò Many promising new algorithms ò Each new algorithm has significant
performance improvements Ø mainly based on results on IBM Almaden artificial
datasets
ò How do these algorithms perform on real-world data?
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 5
5
5.01613,34077,512BMS-WebView-2
2.526749759,602BMS-WebView-1
6.51641,657515,597BMS-POS
10.129870100,000IBM-Artificial
AverageTransaction
Size
MaximumTransaction
Size
DistinctItemsTransactions
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 6
6
IBM artificial dataset is very different from the real-world datasets
0
10
20
30
40
50
60
0 10 20 30 40
Transaction size
Fre
qu
ence
(%
)
IBM-Artificial
BMS-WebView-1
BMS-WebView-2
BMS-POS
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 7
7
BMS-POSIBM-Artificial
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
0.00 0.20 0.40 0.60 0.80 1.00
Minimum support (%)
Count
#Frequent Itemsets
#Association Rules
#Closed Frequent Itemsets
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
0.00 0.20 0.40 0.60 0.80 1.00
Minimum support (%)
Count
#Frequent Itemsets
#Association Rules
#Closed Frequent Itemsets
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 8
8
BMS-WebView-1 BMS-WebView-2
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
0.00 0.20 0.40 0.60 0.80 1.00
Minimum support (%)
Count
#Frequent Itemsets
#Association Rules
#Closed Frequent Itemsets
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
0.00 0.20 0.40 0.60 0.80 1.00Minimum support (%)
Count
#Frequent Itemsets
#Association Rules
#Closed Frequent Itemsets
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 9
9
BMS-WebView-2
Ratio of #Freq itemsets over #Closed freq itemsetsRatio of #Freq itemsets over #Closed freq itemsets
123456789
10
0.00 0.02 0.04 0.06 0.08 0.10
Minimum support (%)R
atio
1
1.1
1.2
1.3
1.4
1.5
1.6
0.00 0.02 0.04 0.06 0.08 0.10
Minimum support (%)
Rat
io
BMS-POS
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 10
10
ò Dual 550MHz Pentium III with 1 GB of memoryò Windows NT 4.0ò Measures: time (seconds) for generating
frequent itemsets/association rulesò Minimum support: 1.00%, 0.80%, 0.60%,
0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, and 0.01%
ò Minimum confidence: 0%
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 11
11
ò Apriori: C. Borgelt’s implementationò Charm: M. Zakiò FP-growth: J. Han’s research groupò Closet: J. Han’s research groupò MagnumOpus : G. Webb
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 12
12
High Min-Support Low Min-Support
IBM-Artificial Ap > FP > Ch > Cl FP > Ch > Cl > Ap
BMS-POS Ap > Cl > FP > Ch Ch > FP > Ap > Cl
BMS-WebView-1 Ap > FP > Cl > Ch Ch > FP > Ap > Cl
BMS-WebView-2 Ap > FP > Ch > Cl Ch > FP > Ap > Cl
Rankings (left is better) of the algorithms for generating frequent itemsets on the four datasets with high minimum supports and low minimum supports (Ap: Apriori, FP: FP-growth, Ch: Charm, Cl: Closet):
Ranking
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 13
13
TIM
E (
Sec
on
ds)
Minimum Support (%)
1.0
10.0
100.0
1,000.0
0.00 0.02 0.04 0.06 0.08 0.10
CharmFP-growthAprioriCloset
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 14
14
TIM
E (
Sec
on
ds)
Minimum Support (%)
1.0
10.0
100.0
1,000.0
0.00 0.20 0.40 0.60 0.80 1.00
Closet
CharmFP-growth
Apriori
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 15
15
100.0
1,000.0
10,000.0
100,000.0
0.00 0.02 0.04 0.06 0.08 0.10
CharmFP-growthApriori
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 16
16
10.0
100.0
1,000.0
10,000.0
100,000.0
0.00 0.20 0.40 0.60 0.80 1.00
ClosetCharmFP-growthApriori
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 17
17(frequent itemsets)(frequent itemsets)
1.0
10.0
100.0
1,000.0
0.00 0.02 0.04 0.06 0.08 0.10
CharmFP-growthAprioriCloset
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 18
18(frequent itemsets)(frequent itemsets)
0.1
1.0
10.0
100.0
1,000.0
0.00 0.20 0.40 0.60 0.80 1.00
ClosetCharmFP-growthApriori
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 19
19(frequent itemsets)(frequent itemsets)
1.0
10.0
100.0
1,000.0
10,000.0
100,000.0
0.00 0.02 0.04 0.06 0.08 0.10
CharmFP-growthAprioriCloset
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 20
20(frequent itemsets)(frequent itemsets)
0.1
1.0
10.0
100.0
1,000.0
10,000.0
100,000.0
0.00 0.20 0.40 0.60 0.80 1.00
ClosetCharmFP-growthApriori
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 21
21
On some real-world datasets, when the minimum support is small, the number of frequent itemsets increases super-exponentially, thus no algorithm can handle it.E.g. BMS-WebView-1:
× 45
× 26
× 10
Frequent itemsetsMinimum support(%)
*: estimated
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 22
22
10
100
1,000
10,000
0.00 0.02 0.04 0.06 0.08 0.10
AprioriMOMO-1000
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 23
23
1
10
100
1,000
10,000
0.00 0.20 0.40 0.60 0.80 1.00
AprioriMOMO-1000
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 24
24
10
100
1,000
10,000
100,000
1,000,000
0.00 0.02 0.04 0.06 0.08 0.10
AprioriMOMO-1000
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 25
25
10
100
1,000
10,000
100,000
1,000,000
0.00 0.20 0.40 0.60 0.80 1.00
AprioriMO
MO-1000
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 26
26(association rules)(association rules)
1
10
100
1,000
0.00 0.02 0.04 0.06 0.08 0.10
AprioriMOMO-1000
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 27
27(association rules)(association rules)
0
1
10
100
1,000
0.00 0.20 0.40 0.60 0.80 1.00
AprioriMOMO-1000
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 28
28(association rules)(association rules)
1
10
100
1,000
10,000
0.00 0.02 0.04 0.06 0.08 0.10
AprioriMOMO-1000
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 29
29(association rules)(association rules)
0
1
10
100
1,000
10,000
0.00 0.20 0.40 0.60 0.80 1.00
AprioriMOMO-1000
TIM
E (
Sec
on
ds)
Minimum Support (%)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 30
30
ò MO could be a solution when Ø the number of association rules is very largeØ only the top-N rules are needed (based on
some criteria such as lift or confidence)
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 31
31
ò First objective evaluation and comparison of association rule algorithms on real-world e-commerce and retail datasets.
ò Donated one e-commerce datasets for use in the research community.
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 32
32
ò Artificial datasets have very different characteristics from the real-world datasets.
ò On real world data:ØVery narrow min-sup range of interest ØSuper exponential growth in number of rules
ò Performance improvements on artificial data did not generalize to real world data
ò Optimizing algorithms for these artificial datasets can mislead research effort