the university of north carolina at chapel hill association rule mining comp 790-90 seminar gnet 713...

30
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

Upload: nigel-bailey

Post on 25-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Association Rule Mining

COMP 790-90 Seminar

GNET 713 BCB Module

Spring 2007

Page 2: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

2

Approximate match

When you observe d1Spread count as

d1: 90%, d2: 5%, d3: 5%

Compatibility Matrix

Page 3: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

3

Match

The degree to which pattern P is retained/reflected in S

M(P,S) = P(P|S)= C(p,s) when when lS=lP

M(P,S) = max over all possible when lS>lP

ExampleP S M

d1d1 d1d3 0.9*0

d1d2 d1d2 0.9*0.8

d1d2 d1d3 0.9*0.05

d1d2 d2d3 0.1*0.05

d1d2 d1d2d3 0.9*0.8

Page 4: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

4

Calculate Max over all

Dynamic ProgrammingM(p1p2..pi, s1s2…sj)= Max of

M(p1p2..pi-1, s1s2…sj-1) * C(pi,sj)

M(p1p2..pi, s1s2…sj-1)

O(lP*lS)

When compatibility Matrix is sparse O(lS)

Page 5: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

5

Match in D

Average over all sequences in D

Page 6: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

6

Spread of match

If compatibility matrix is identity matrixMatch = support

Page 7: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

7

Anti-Monotone

The match of a pattern P in a symbol sequence S is less than or equal to the match of any subpattern of P in SThe match of a pattern P in a sequence database D is less than or equal to the match of any subpattern of P in DCan use any support based algorithm

More patterns match so require efficient solutionSample based algorithms

Border collapsing of ambiguous patterns

Page 8: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

8

Chernoff Bound

Given sample size=n, range R,with probability 1-true value: = sqrt([R2ln(1/)]/2n)

Distribution freeMore conservativeSample size : fit in memoryRestricted spread :

For pattern P= p1p2..pL

R=min (match[pi]) for all 1 i L

Frequent Patterns

min_match +

min_match -

Infrequent patterns

Page 9: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

9

Algorithm

Scan DB: O(N*min (Ls*m, Ls+m2))

Find the match of each individual symbol

Take a random sample of sequences

Identify borders that embrace the set of ambiguous patterns O(mLp * |S| * Lp * n)

Min_match existing methods for association rule mining

Locate the border of frequent patterns in the entire DB

via border collapsing

Page 10: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

10

Border Collapsing

If memory can not hold the counters for all ambiguous countersProbe-and-collapse : binary searchProbe patterns with highest collapsing power until memory is filledIf memory can hold all patterns up to the 1/x layer

the space of of ambiguous patterns can be narrowed to at least 1/x of the original onewhere x is a power of 2

If it takes a level-wise search y scans of the DB, only O(logxy) scans are necessary when the border collapsing technique is employed

Page 11: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007
Page 12: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

12

Periodic Pattern

Full periodic patternABC ABC ABC

Partial periodic patternABC ADC ACC ABC

Pattern hierarchyABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE

Page 13: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

13

Periodic Pattern

Recent AchievementsPartial Periodic Pattern

Asynchronous Periodic Pattern

Meta Pattern

InfoMiner/InfoMiner+/STAMP

Page 14: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

14

Clustering Sequential Data

CLUSEQ

ApproxMAP

Page 15: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

15

References: Frequent-pattern Mining Methods

R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000.R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile.J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate generation”. In Proc. ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May 2000. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994.

Page 16: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

16

References: Frequent-pattern Mining Methods

A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95, 432-443, Zurich, Switzerland.C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98, 594-605, New York, NY.R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96, 1-12, Montreal, Canada.H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996.M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. KDD’97. August 1997.

Page 17: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

17

References: Performance Improvements

S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97, Tucson, Arizona, May 1997.D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. ICDE'96, New Orleans, LA.T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, Montreal, Canada.E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97, Tucson, Arizona.J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, San Jose, CA, May 1995.

Page 18: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

18

References: Performance Improvements

G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery in Databases,. AAAI/MIT Press, 1991.J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, San Jose, CA.S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, Seattle, WA.K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97, Newport Beach, CA, Aug. 1997.M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.

Page 19: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

19

References. Extensions of Scopes

S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97, 265-276, Tucson, Arizona.J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420-431, Zurich, Switzerland.M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New York, NY.Wei Wang, Jiong Yang, Philip S. Yu: Efficient mining of weighted association rules (WAR). KDD 2000: 270-274 Wei Wang, Jiong Yang, Richard R. Muntz: TAR: Temporal Association Rules on Evolving Numerical Attributes. ICDE 2001: 283-292

Page 20: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

20

References: Extensions of Scopes

B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England.R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133, Bombay, India.R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.J. Pei, A.K.H. Tung, J. Han. Fault-Tolerant Frequent Pattern Mining: Problems and Challenges. SIGMOD DMKD’01, Santa Barbara, CA.

Page 21: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

21

References: Mining Max-patterns and Closed ItemsetsR. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle, Washington.J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May 2000.N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule Mining, TR99-10, Department of Computer Science, Rensselaer Polytechnic Institute.M. Zaki, Fast Vertical Mining Using Diffsets, TR01-1, Department of Computer Science, Rensselaer Polytechnic Institute.

Page 22: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

22

References: Constraint-base Frequent-pattern Mining

G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521, San Diego, CA, Feb. 2000.

Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95, 39-46, Singapore, Dec. 1995.

J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining", COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.

L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, "Optimization of Constrained Frequent Set Queries with 2-Variable Constraints", SIGMOD’99

Page 23: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

23

References: Constraint-base Frequent-pattern Mining

R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. “Exploratory mining and pruning optimizations of constrained association rules.” SIGMOD’98

J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets with Convertible Constraints", Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), April 2001.

J. Pei and J. Han "Can We Push More Constraints into Frequent Pattern Mining?", Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.

R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73, Newport Beach, California.

Page 24: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

24

References: Sequential Pattern Mining Methods

R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’96.J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, "FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining", Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997.J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth", Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.

Page 25: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

25

References: Sequential Pattern Mining Methods

B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL.S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules. VLDB'98, 368-379, New York, NY.M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98. Novermber 1998.M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 1999: 223-234, Edinburgh, Scotland.Wei Wang, Jiong Yang, Philip S. Yu: Mining Patterns in Long Sequential Data with Noise. SIGKDD Explorations 2(2): 28-33 (2000) Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han: Mining Long Sequential Patterns in a Noisy Environment. SIGMOD Conference 2002

Page 26: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

26

References: Periodic Pattern Mining Methods

Jiawei Han, Wan Gong, Yiwen Yin: Mining Segment-Wise Periodic Patterns in Time-Related Databases. KDD 1998: 214-218

Jiawei Han, Guozhu Dong, Yiwen Yin: Efficient Mining of Partial Periodic Patterns in Time Series Database. ICDE 1999: 106-115

Jiong Yang, Wei Wang, Philip S. Yu: Mining asynchronous periodic patterns in time series data. KDD 2000: 275-279

Wei Wang, Jiong Yang, Philip S. Yu: Meta-patterns: Revealing Hidden Periodic Patterns. ICDM 2001: 550-557

Jiong Yang, Wei Wang, Philip S. Yu: Infominer: mining surprising periodic patterns. KDD 2001: 395-400

Page 27: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

27

References: Mining Various Databases

K. Koperski, J. Han, and G. B. Marchisio, "Mining Spatial and Image Data through Progressive Refinement Methods", Revue internationale de gomatique (European Journal of GIS and Spatial Analysis), 9(4):425-440, 1999.

A. K. H. Tung, H. Lu, J. Han, and L. Feng, "Breaking the Barrier of Transactions: Mining Inter-Transaction Association Rules", Proc. 1999 Int. Conf. on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, Aug. 1999, pp. 297-301.

Page 28: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

28

References: Mining Various Databases

H. Lu, L. Feng, and J. Han, "Beyond Intra-Transaction Association Analysis:Mining Multi-Dimensional Inter-Transaction Association Rules", ACM Transactions on Information Systems (TOIS’00), 18(4): 423-454, 2000.

O. R. Zaiane, M. Xin, J. Han, "Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs," Proc. Advances in Digital Librar ies Conf. (ADL'98), Santa Barbara, CA, April 1998, pp. 19-29

O. R. Zaiane, J. Han, and H. Zhu, "Mining Recurrent Items in Multimedia with Progressive Resolution Refinement", Proc. 2000 Int. Conf. on Data Engineering (ICDE'00), San Diego, CA, Feb. 2000, pp. 461-470.

Page 29: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

29

References: Frequent-pattern Mining for Classification and Data Cube

K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99, 359-370, Philadelphia, PA, June 1999.M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.J. Han, J. Pei, G. Dong, and K. Wang, “Computing Iceberg Data Cubes with Complex Measures”, Proc. ACM-SIGMOD’2001, Santa Barbara, CA, May 2001.M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. KDD'97, 207-210, Newport Beach, California.K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD’99T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Technical Report, Aug. 2000

Page 30: The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar GNET 713 BCB Module Spring 2007

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

THANK YOU!