zijian zheng, ron kohavi, and llew...

32
© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 1 1 Zijian Zheng, Ron Kohavi, and Llew Mason {zijian,ronnyk,lmason}@bluemartini.com Blue Martini Software San Mateo, California August 11, 2001

Upload: others

Post on 02-Nov-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 1

1

Zijian Zheng, Ron Kohavi, and Llew Mason{zijian,ronnyk,lmason}@bluemartini.com

Blue Martini SoftwareSan Mateo, California

August 11, 2001

Page 2: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 2

2Performance Improvement IrrelevantPerformance Improvement Irrelevant

0%

Ø Apriori sufficientØ <1,000,000 rulesØ < 10 minutes for all alg.

Ø ImpossibleØ Super exponential

growth in # RulesØ >1,000,000,000 rules

• Interesting• Range as narrow as 0.02%

ò Very narrow min-sup range of interestò Super exponential growth in number of

rules on real world data

ò Fast, but incorrect results

Page 3: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 3

3

ò Performance improvements on artificial data did not generalize to real world data

1.0 x0.8 x1.1 x1.2 xReal-POS

3.1 x12.1 x2.8 x1.9 x Artificial

AvgFP-growthCharmCloset

Improvement over Apriori:

ò Are algorithms overfitting the artificial data?

0102030405060

0 10 20 30 40Transaction size

Fre

qu

ence

(%) Artificial

Real-Web-1Real-Web-2Real-POS

Page 4: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 4

4

ò Association rule discovery: Ø Active research areaØ Good application potential

ò Many promising new algorithms ò Each new algorithm has significant

performance improvements Ø mainly based on results on IBM Almaden artificial

datasets

ò How do these algorithms perform on real-world data?

Page 5: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 5

5

5.01613,34077,512BMS-WebView-2

2.526749759,602BMS-WebView-1

6.51641,657515,597BMS-POS

10.129870100,000IBM-Artificial

AverageTransaction

Size

MaximumTransaction

Size

DistinctItemsTransactions

Page 6: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 6

6

IBM artificial dataset is very different from the real-world datasets

0

10

20

30

40

50

60

0 10 20 30 40

Transaction size

Fre

qu

ence

(%

)

IBM-Artificial

BMS-WebView-1

BMS-WebView-2

BMS-POS

Page 7: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 7

7

BMS-POSIBM-Artificial

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

0.00 0.20 0.40 0.60 0.80 1.00

Minimum support (%)

Count

#Frequent Itemsets

#Association Rules

#Closed Frequent Itemsets

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

0.00 0.20 0.40 0.60 0.80 1.00

Minimum support (%)

Count

#Frequent Itemsets

#Association Rules

#Closed Frequent Itemsets

Page 8: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 8

8

BMS-WebView-1 BMS-WebView-2

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

0.00 0.20 0.40 0.60 0.80 1.00

Minimum support (%)

Count

#Frequent Itemsets

#Association Rules

#Closed Frequent Itemsets

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

0.00 0.20 0.40 0.60 0.80 1.00Minimum support (%)

Count

#Frequent Itemsets

#Association Rules

#Closed Frequent Itemsets

Page 9: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 9

9

BMS-WebView-2

Ratio of #Freq itemsets over #Closed freq itemsetsRatio of #Freq itemsets over #Closed freq itemsets

123456789

10

0.00 0.02 0.04 0.06 0.08 0.10

Minimum support (%)R

atio

1

1.1

1.2

1.3

1.4

1.5

1.6

0.00 0.02 0.04 0.06 0.08 0.10

Minimum support (%)

Rat

io

BMS-POS

Page 10: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 10

10

ò Dual 550MHz Pentium III with 1 GB of memoryò Windows NT 4.0ò Measures: time (seconds) for generating

frequent itemsets/association rulesò Minimum support: 1.00%, 0.80%, 0.60%,

0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, and 0.01%

ò Minimum confidence: 0%

Page 11: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 11

11

ò Apriori: C. Borgelt’s implementationò Charm: M. Zakiò FP-growth: J. Han’s research groupò Closet: J. Han’s research groupò MagnumOpus : G. Webb

Page 12: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 12

12

High Min-Support Low Min-Support

IBM-Artificial Ap > FP > Ch > Cl FP > Ch > Cl > Ap

BMS-POS Ap > Cl > FP > Ch Ch > FP > Ap > Cl

BMS-WebView-1 Ap > FP > Cl > Ch Ch > FP > Ap > Cl

BMS-WebView-2 Ap > FP > Ch > Cl Ch > FP > Ap > Cl

Rankings (left is better) of the algorithms for generating frequent itemsets on the four datasets with high minimum supports and low minimum supports (Ap: Apriori, FP: FP-growth, Ch: Charm, Cl: Closet):

Ranking

Page 13: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 13

13

TIM

E (

Sec

on

ds)

Minimum Support (%)

1.0

10.0

100.0

1,000.0

0.00 0.02 0.04 0.06 0.08 0.10

CharmFP-growthAprioriCloset

Page 14: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 14

14

TIM

E (

Sec

on

ds)

Minimum Support (%)

1.0

10.0

100.0

1,000.0

0.00 0.20 0.40 0.60 0.80 1.00

Closet

CharmFP-growth

Apriori

Page 15: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 15

15

100.0

1,000.0

10,000.0

100,000.0

0.00 0.02 0.04 0.06 0.08 0.10

CharmFP-growthApriori

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 16: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 16

16

10.0

100.0

1,000.0

10,000.0

100,000.0

0.00 0.20 0.40 0.60 0.80 1.00

ClosetCharmFP-growthApriori

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 17: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 17

17(frequent itemsets)(frequent itemsets)

1.0

10.0

100.0

1,000.0

0.00 0.02 0.04 0.06 0.08 0.10

CharmFP-growthAprioriCloset

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 18: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 18

18(frequent itemsets)(frequent itemsets)

0.1

1.0

10.0

100.0

1,000.0

0.00 0.20 0.40 0.60 0.80 1.00

ClosetCharmFP-growthApriori

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 19: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 19

19(frequent itemsets)(frequent itemsets)

1.0

10.0

100.0

1,000.0

10,000.0

100,000.0

0.00 0.02 0.04 0.06 0.08 0.10

CharmFP-growthAprioriCloset

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 20: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 20

20(frequent itemsets)(frequent itemsets)

0.1

1.0

10.0

100.0

1,000.0

10,000.0

100,000.0

0.00 0.20 0.40 0.60 0.80 1.00

ClosetCharmFP-growthApriori

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 21: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 21

21

On some real-world datasets, when the minimum support is small, the number of frequent itemsets increases super-exponentially, thus no algorithm can handle it.E.g. BMS-WebView-1:

× 45

× 26

× 10

Frequent itemsetsMinimum support(%)

*: estimated

Page 22: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 22

22

10

100

1,000

10,000

0.00 0.02 0.04 0.06 0.08 0.10

AprioriMOMO-1000

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 23: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 23

23

1

10

100

1,000

10,000

0.00 0.20 0.40 0.60 0.80 1.00

AprioriMOMO-1000

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 24: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 24

24

10

100

1,000

10,000

100,000

1,000,000

0.00 0.02 0.04 0.06 0.08 0.10

AprioriMOMO-1000

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 25: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 25

25

10

100

1,000

10,000

100,000

1,000,000

0.00 0.20 0.40 0.60 0.80 1.00

AprioriMO

MO-1000

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 26: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 26

26(association rules)(association rules)

1

10

100

1,000

0.00 0.02 0.04 0.06 0.08 0.10

AprioriMOMO-1000

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 27: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 27

27(association rules)(association rules)

0

1

10

100

1,000

0.00 0.20 0.40 0.60 0.80 1.00

AprioriMOMO-1000

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 28: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 28

28(association rules)(association rules)

1

10

100

1,000

10,000

0.00 0.02 0.04 0.06 0.08 0.10

AprioriMOMO-1000

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 29: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 29

29(association rules)(association rules)

0

1

10

100

1,000

10,000

0.00 0.20 0.40 0.60 0.80 1.00

AprioriMOMO-1000

TIM

E (

Sec

on

ds)

Minimum Support (%)

Page 30: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 30

30

ò MO could be a solution when Ø the number of association rules is very largeØ only the top-N rules are needed (based on

some criteria such as lift or confidence)

Page 31: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 31

31

ò First objective evaluation and comparison of association rule algorithms on real-world e-commerce and retail datasets.

ò Donated one e-commerce datasets for use in the research community.

Page 32: Zijian Zheng, Ron Kohavi, and Llew Masonrobotics.stanford.edu/users/ronnyk/realWorldAssocSlides.pdf · Title: D:\ronny\ai\my\assoc\realWorldAssoc.prn.pdf Author: ronnyk Created Date:

© Copyright 2000-2001, Blue Martini Software. San Mateo California, USA 32

32

ò Artificial datasets have very different characteristics from the real-world datasets.

ò On real world data:ØVery narrow min-sup range of interest ØSuper exponential growth in number of rules

ò Performance improvements on artificial data did not generalize to real world data

ò Optimizing algorithms for these artificial datasets can mislead research effort