mining frequent item sets by opportunistic projection

34
Mining Frequent Item Sets by Opportunistic Projection Junqiang Liu 1,4 , Yunhe Pan 1 , Ke Wang 2 , Jiawei Han 3 1 Institute of Artificial Intelligence, Zhejiang University, China 2 School of Computing Science, Simon Fraser University, Canada 3 Department of Computer Science, UIUC, USA

Upload: amaris

Post on 11-Jan-2016

40 views

Category:

Documents


2 download

DESCRIPTION

Mining Frequent Item Sets by Opportunistic Projection. Junqiang Liu 1,4 , Yunhe Pan 1 , Ke Wang 2 , Jiawei Han 3 1 Institute of Artificial Intelligence, Zhejiang University, China 2 School of Computing Science, Simon Fraser University, Canada 3 Department of Computer Science, UIUC, USA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining Frequent Item Sets by Opportunistic Projection

Mining Frequent Item Sets by Opportunistic Projection

Junqiang Liu1,4, Yunhe Pan1, Ke Wang2, Jiawei Han3

1 Institute of Artificial Intelligence, Zhejiang University, China2 School of Computing Science, Simon Fraser University,

Canada3 Department of Computer Science, UIUC, USA

4 Dept. of CS, Hangzhou University of Commerce, China

Page 2: Mining Frequent Item Sets by Opportunistic Projection

2

Outline

How to discover frequent item sets

Previous works

Our approach: Mining Frequent Item

Sets by Opportunistic Projection

Performance evaluations

Conclusions

Page 3: Mining Frequent Item Sets by Opportunistic Projection

3

What Are Frequent Items Sets

What is a frequent item set? set of items, X, that occurs together frequently in a

database, i.e., support(X) ≥ a given threshold

Example

tid items

01 a c d f g i m p

02 a b c f l m o

03 b f h j o

04 b c k p s

05 a c e f l m n p

Given support threshold 3, frequent item sets are as follows:

a:3, b:3, c:4, f :4, m:3, p:3,

ac:3, af :3, am:3, cf :3, cm:3, cp:3, fm:3,

acf :3, acm:3, afm:3, cfm:3,

acfm:3

Page 4: Mining Frequent Item Sets by Opportunistic Projection

4

How To Discover Frequent Item Sets

Frequent item sets can be represented by a tree, which is not necessarily materailized.

Mining process: a process of tree construction, accompanied

by a process of projecting transaction subsets

( , )

(a,3) (b,3) (c,4) (f,4) (m,3) (p,3)

(c,3) (f,3) (m,3) (f,3) (m,3) (p,3) (m,3)

(f,3) (m,3) (m,3) (m,3)

(m,3)

Page 5: Mining Frequent Item Sets by Opportunistic Projection

5

Frequent Item Set Tree - FIST

FIST is an ordered tree each node: (item,weight) the following are imposed

items ordered on a path (top-down) items ordered at children (left to right)

Frequent item set a path starting from the FIST root its support is the ending node’s weight

PTS - projected transaction subset Each FIST node has its own PTS, filtered or

unfiltered All transactions that support the frequent item set

represented by the node

Page 6: Mining Frequent Item Sets by Opportunistic Projection

6

Frequent Item Set Tree (example)

( , )

(a,3) (b,3) (c,4) (f,4) (m,3) (p,3)

(c,3) (f,3) (m,3) (f,3) (m,3) (p,3) (m,3)

(f,3) (m,3) (m,3) (m,3)

(m,3)

01 a c d f g i m p02 a b c f l m o03 b f h j o04 b c k p s05 a c e f l m n p

01 c f m p02 b c f m05 c f m p

02 c f m03 f04 c p

01 f m p02 f m04 p05 f m p

01 m p02 m05 m p

01 p05 p

01 f m02 f m05 f m

01 m02 m05 m

01 m02 m05 m

01 m p02 m05 m p

01 p05 p

(i ,w): a FIST node

: the PTS of the node

Page 7: Mining Frequent Item Sets by Opportunistic Projection

7

Factors relate toMining Efficiency and Scalability

The FIST construction strategy breadth first v.s. depth first

The PTS representation Memory-based representation: array-based, tree-

based, vertical bitmap, horizontal bitstring, etc. Disk-based representation

PTS projecting method and item counting method

Page 8: Mining Frequent Item Sets by Opportunistic Projection

8

Previous Works

Research

StrategyPTS

Representation

ProjectingMethod

Remarks

Apriori breadth first original DB on the fly Repetitive DB ScansHuge FIST for denseExp. pattern matching

TreeProjection

breadth first original DB on the fly

FPGrowth

depth first FP-tree

recursively materialize conditional DB/Fptree

#of conditional FPtree in same order of mag. as # of fre. item sets

H-Mine depth first H-struct partially materialize sub H-struct

Not most eff. for sparseCall FP-Growth for densePartition for large

DepthProject

depth firsthorizontal bitstring

selective projection

Maximal fre. item sets

Less efficient than array-based for sparse & large

Less efficient than tree-based for dense

MAFIA depth firstvertical bitmap

recursively materialize

compressions

Page 9: Mining Frequent Item Sets by Opportunistic Projection

9

Our Approach: Mining Frequent Item Sets by Opportunistic Projection

Philosophy: The algorithm must adapt the construction strategy

of FIST, the representation of PTS, and the methods of item counting in and projection of PTSs to the features of PTSs.

Main points: Mining sparse data by projecting array-based

PTS Intelligent projecting tree-based PTS for dense

data Heuristics for opportunistic projection

Page 10: Mining Frequent Item Sets by Opportunistic Projection

10

Mining sparse data by projecting array-based PTS

TVLA – threaded varied length array for sparse PTS

FIL– local frequent items list LQ – linked queues arrays

Each local frequent item has a FIL entry that consists of an item, a count, & a pointer.

Each transaction is stored in an array that is threaded to FIL by LQ according to the heading item in the imposing order.

a 3b 3c 4f 4m 3p 3

acfmp

acfmp

abcfm

bf

bcp

01f04

02 0503

FI L

fi l tered TVLA of the ori gi nal DB i n the exampl e

LQ

array

Page 11: Mining Frequent Item Sets by Opportunistic Projection

11

How to project TVLA for PTS

Arrays (transactions) that support a node’s first child are threaded by the LQ attached to the first entry of FIL. (see previous figure)

TVLA for a child node’s PTS has its own FIL and LQ.

A child TVLA is unfiltered if it shares arrays with its parent, filtered otherwise.

a 3b 3c 4f 4m 3p 3

acfmp

acfmp

abcfm

bf

bcp

01f04

02 0503

unfi l tered chi l d TVLA

c 3f 3m 3

01 02 05

c 3f 3m 3 c

fm

cfm

cfm

01 02 05FI L(a)

FI L

FI L(a)

fi l tered chi l d TVLA

parent TVLA

Page 12: Mining Frequent Item Sets by Opportunistic Projection

12

How to project TVLA for PTS (cont.)Get next child’s PTS by shifting transactions threaded in the LQ currently explored (current child’s PTS)

a 3b 3c 4f 4m 3p 3

acfmp

acfmp

abcfm

bf

bcp

01f0402

0503 a 3

b 3c 4f 4m 3p 3

acfmp

acfmp

abcfm

bf

bcp

01 f0402 05

a 3b 3c 4f 4m 3p 3

acfmp

acfmp

abcfm

bf

bcp

01 02 05a 3b 3c 4f 4m 3p 3

acfmp

acfmp

abcfm

bf

bcp

01 05

TVLAi n sl i de 10

NULL

Page 13: Mining Frequent Item Sets by Opportunistic Projection

13

Intelligent projecting tree-based PTS for dense data

Tree-based Representation of dense PTS, inspired by FP-Growth

Novel projecting methods, totally differ from FP-Growth Bottom up pseudo projection Top down pseudo projection

Page 14: Mining Frequent Item Sets by Opportunistic Projection

14

Tree-based Representation of dense PTS

TTF - threaded transaction forest IL - item list: each entry consists of an item, a count, and a pointer. Forest: each node labeled by an item, associated with a weight.

Each local item in PTS has an entry in the IL.

Each transaction in the PTS is one path starting from a root in the forest.

count is the number of transactions represented by the path.

All nodes of the same item threaded by an IL entry.

TTF is filtered if only local frequent items appear in TTF, otherwise unfiltered.

a 3b 3c 4f 4m 3p 3

a, 3

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 2 p, 1

fi l tered TTF of ori gi nal DB i n the exampl e

Page 15: Mining Frequent Item Sets by Opportunistic Projection

15

Bottom up pseudo projection of TTF (example)

a 3b 3c 4f 4m 3p 3

a 3b 3c 4f 4m3p 2

a, 3

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 2 p, 1

a 3b 3c 4f 4m3p 2

a 3b 3c 4f 3m3p 3

a 3b 1c 3f 3m3p 2

a 3b 3c 2f 2m1p 1

a, 3

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 1p, 2

a, 3

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

c, 1

m, 1

p, 1p, 2

a, 3

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 1p, 2

a, 3

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 1p, 2

a, 3

b, 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 2

b, 2

b, 2

p, 1

f , 1

Page 16: Mining Frequent Item Sets by Opportunistic Projection

16

Top down pseudo projection of TTF (example)

a 3b 3c 4f 4m 3p 3

a 3b 2c 4f 4m 3p 3

a, 3

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 2 p, 1

a 1b 3c 4f 4m 3p 3

a 3b 2c 3f 4m 3p 3

a 2b 1c 3f 2m2p 3

a 3b 1c 3f 3m3p 3

a, 1

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 1p, 2

a, 3

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

c, 1

m, 1

p, 1p, 2

a, 3

m, 2

c, 2 c, 1

f , 2 f , 1

b, 1

c, 1

m, 1

p, 1p, 2

a, 3

b, 1

f , 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 2

c, 1

m, 1

p, 1p, 2

a, 2

b, 1

m, 2

c, 2 c, 1

f , 2 f , 1

b, 1

c, 1

m, 1

p, 2

b, 1

b, 1

p, 1

f , 1 f , 1

Page 17: Mining Frequent Item Sets by Opportunistic Projection

17

Opportunistic Projection: Observations and Heuristics

Observation 1: Upper portion of a FIST can fit in memory. Transactions’ Number that support length k item sets

decreases sharply when k is greater than 2. Heuristic 1:

Grow the upper portion of a FIST breadth first. Grow the lower portion under level k depth first, whenever

the reduced transaction set can be represented by a memory based structure, either TVLA or TTF.

Page 18: Mining Frequent Item Sets by Opportunistic Projection

18

Opportunistic Projection: Observations and Heuristics(2)

Observation 2: TTF compresses well at lower levels or denser branches,

where there are fewer local frequent items in PTSs and the relative support is larger.

TTF is space expensive relative to TVLA if its compression ratio is less than 6-t/n ( t: number of transactions, n: number of items in a PTS).

Heuristic 2: Represent PTSs by TVLA at high levels on FIST, unless the

estimated compression ratio of TTF is sufficiently high.

Page 19: Mining Frequent Item Sets by Opportunistic Projection

19

Opportunistic Projection: Observations and Heuristics(3)

Observation 3: PTSs shrink very quickly at high levels or sparse branches

on FIST where filtered PTSs are usually in form of TVLA. PTSs at lower levels or dense branches shrink slowly

where PTSs are represented by TTF. The creation of filtered TTF involves expensive pattern matching.

Heuristic 3: Make a filtered copy for the child TVLA as long as there is

free memory when projecting a parent TVLA. Delimitate the pseudo child TTF first and then make a

filtered copy if it shrinks substantially sharp when projecting a parent TTF.

Page 20: Mining Frequent Item Sets by Opportunistic Projection

20

Algorithm OpportuneProject

OpportuneProject(Database: D) begin create a null root for frequent item set

tree T; D’= BreadthFirst(T, D); GuidedDepthFirst(root_of_T, D’); end

Page 21: Mining Frequent Item Sets by Opportunistic Projection

21

Performance Evaluation: Efficiency on BMS-POS (sparse)

Page 22: Mining Frequent Item Sets by Opportunistic Projection

22

Performance Evaluation: Efficiency on BMS-WebView1 (sparse)

Page 23: Mining Frequent Item Sets by Opportunistic Projection

23

Performance Evaluation: Efficiency on BMS-WebView2 (sparse)

Page 24: Mining Frequent Item Sets by Opportunistic Projection

24

Performance Evaluation: Efficiency on Connect4 (dense)

Page 25: Mining Frequent Item Sets by Opportunistic Projection

25

Performance Evaluation: Efficiency on T25I20D100kN20kL5k

Page 26: Mining Frequent Item Sets by Opportunistic Projection

26

Performance Evaluation: Scalability on T25I20D1mN20kL5k

Page 27: Mining Frequent Item Sets by Opportunistic Projection

27

Performance Evaluation: Scalability on T25I20D10mN20kL5k

Page 28: Mining Frequent Item Sets by Opportunistic Projection

28

Performance Evaluation: Scalability on T25I20D100k~15mN20kL5k

Page 29: Mining Frequent Item Sets by Opportunistic Projection

29

Conclusions

OpportuneProject maximize efficiency and scalability for all data

features by combining depth first with breadth first search strategies

array-based and tree-based representation for

projected transaction subsets

unfiltered, and filetered projections

Page 30: Mining Frequent Item Sets by Opportunistic Projection

30

Acknowledgement

We would like to thank Blue Martini Software, Inc.

for providing us the BMS datasets!

Page 31: Mining Frequent Item Sets by Opportunistic Projection

31

References

[1] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000.

[2] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. Depth first generation of long patterns, in Proceedings of SIGKDD Conference, 2000.

[3] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD’93, Washington, D.C., May 1993.

[4] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB'94, pp. 487-499, Santiago, Chile, Sept. 1994.

[5] R.J.Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98, pp. 85-93, Seattle, Washington, June 1998.

[6] D.Burdick, M.Calimlim, J.Gehrke. MAFIA: A maximal frequent itemset algorithm for transactional databases. In proceedings of the 17th Internation Conference on Data Engineering, Heidelberg, Germany, April 2001.

[7] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Analysis. In SIGMOD’97, 255-264. Tucson, AZ, May 1997.

Page 32: Mining Frequent Item Sets by Opportunistic Projection

32

References (2)

[8] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In VLDB'95, Zuich, Switzerland, Sept. 1995.

[9] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD’2000, Dallas, TX, May 2000.

[10] D-I. Lin and Z. M. Kedem. Pincer-search: A new algorithm for discovering the maximum frequent set. In 6th Intl. Conf. Extending Database Technology, March 1998.

[11] J.S.Park, M.S.Chen, and P.S.Yu. An effective hash based algorithm for mining association rules. In Proc. 1995 ACM-SIGMOD, 175-186, San Jose, CA, Feb. 1995.

[12] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases, Proc. 2001 Int. Conf. on Data Mining (ICDM'01)}, San Jose, CA, Nov. 2001.

[13] Ashok Sarasere, Edward Omiecinsky, and Shamkant Navathe. An efficient algorithm for mining association rules in large databases. In 21st Int'l Conf. on Very Large Databases (VLDB), Zurich, Switzerland, Sept. 1995.

Page 33: Mining Frequent Item Sets by Opportunistic Projection

33

References (3)

[14] H.Toivonen. Sampling large databases for association rules. In Proc. 1996 Int. Conf. Very Large Data Bases (VLDB’96), 134-145, Bombay, India, Sept. 1996.

[15] Zijian Zheng, Ron Kohavi and Llew Mason. Real World Performance of Association Rule Algorithms. In Proc. 2001 Int. Conf. on Knowledge Discovery in Databases (KDD'01), San Francisco, California, Aug. 2001.

[16] http://fuzzy.cs.uni-magdeburg.de/~borgelt/src/apriori.exe [17] http://www.almaden.ibm.com/cs/quest/syndata.html [18] http://www.ics.uci.edu/~mlearn/MLRepository.html

Page 34: Mining Frequent Item Sets by Opportunistic Projection

34

Thank you !!!Thank you !!!