unit 4 december 5, 2015 data mining: concepts and techniques1

65
UNIT 4 June 20, 2022 Data Mining: Concepts and Techniques 1

Upload: kevin-lester

Post on 14-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

UNIT 4

April 21, 2023Data Mining: Concepts and

Techniques 1

Page 2: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 2

Chapter 5: Mining Frequent Patterns, Association and Correlations

Basic concepts and a road map Efficient and scalable frequent itemset

mining methods Mining various kinds of association rules From association mining to correlation

analysis

Page 3: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 3

What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences,

substructures, etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the

context of frequent itemsets and association rule mining

Applications

Basket data analysis, cross-marketing, catalog design, sale

campaign analysis, Web log (click stream) analysis, and DNA

sequence analysis.

Page 4: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 4

Why Is Freq. Pattern Mining Important?

Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining

tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications

Page 5: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 5

Basic Concepts: Frequent Patterns and Association Rules

Itemset X = {x1, …, xk}

Find all the rules X Y with minimum support and confidence support, s, probability that a

transaction contains X Y confidence, c, conditional

probability that a transaction having X also contains Y

Let supmin = 50%, confmin = 50%

Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}Association rules:

A D (60%, 100%)D A (60%, 75%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id Items bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F

Page 6: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 6

Closed Patterns and Max-Patterns

A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (100

1) + (1002) + … +

(11

00

00) = 2100 – 1 = 1.27*1030 sub-patterns!

Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no

super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)

An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98)

Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules

Page 7: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 7

Closed Patterns and Max-Patterns

Exercise. DB = {<a1, …, a100>, < a1, …, a50>}

Min_sup = 1. What is the set of closed itemset?

<a1, …, a100>: 1

< a1, …, a50>: 2

What is the set of max-pattern? <a1, …, a100>: 1

What is the set of all patterns? !!

Page 8: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 8

Chapter 5: Mining Frequent Patterns, Association and

Correlations Basic concepts and a road map Efficient and scalable frequent itemset

mining methods Mining various kinds of association rules From association mining to correlation

analysis

Page 9: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 9

Scalable Methods for Mining Frequent Patterns

The downward closure property of frequent patterns Any subset of a frequent itemset must be

frequent If {beer, diaper, nuts} is frequent, so is {beer,

diaper} i.e., every transaction having {beer, diaper, nuts}

also contains {beer, diaper} Scalable mining methods: Three major approaches

Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00) Vertical data format approach (Charm—Zaki &

Hsiao @SDM’02)

Page 10: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Frequent itemsets

April 21, 2023Data Mining: Concepts and

Techniques 10

Page 11: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Association rule mining can be viewed as a two step process:

1) Find all frequent pattern 2) Generate strong association rules from

frequent itemsets

April 21, 2023Data Mining: Concepts and

Techniques 11

Page 12: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 12

Apriori: A Candidate Generation-and-Test Approach

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from

length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can

be generated

Page 13: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 13

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

Page 14: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 14

The Apriori Algorithm

Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Page 15: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 15

Important Details of Apriori

How to generate candidates? Step 1: self-joining Lk

Step 2: pruning How to count supports of candidates? Example of Candidate-generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd acde from acd and ace

Pruning: acde is removed because ade is not in L3

C4={abcd}

Page 16: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 16

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1

< q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Page 17: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 17

Challenges of Frequent Pattern Mining

Challenges

Multiple scans of transaction database

Huge number of candidates

Tedious workload of support counting for

candidates

Page 18: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Improving efficiency of apriori

Variations of apriori to improve efficiencyHash based techniqueTransaction reductionPartitioning SamplingDynamic itemset counting 18

Page 19: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Method 1) Hash based technique

hashing itemsets into corresponding buckets A hash-based technique can be used to reduce the

size of the candidate k-itemsets, Ck, for k > 1.

April 21, 2023Data Mining: Concepts and

Techniques 19

Page 20: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Method 2) Transaction reduction

(reducing the number of transactions scanned in future iterations)

A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets.

Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it.

April 21, 2023Data Mining: Concepts and

Techniques 20

Page 21: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Method 3 ) partitioning

(partitioning the data to find candidate itemsets)

Scan Database Only Twice

Any itemset that is potentially frequent in DB must

be frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent

patterns

minimum support count for each partition=

min_support * number of transactions in that partition.

Scan 2: consolidate global frequent patternsApril 21, 2023

Data Mining: Concepts and Techniques 21

Page 22: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 22

Page 23: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Method 4) Sampling (mining on a subset of the given data)

The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D.

The sample size of S is such that the search for frequent itemsets in S can be done in main memory, and so only one scan of the transactions in S is required overall.

Because we are searching for frequent itemsets in S rather than in D, it is possible that we will miss some of the global frequent itemsets.

To lessen this possibility, we use a lower support threshold than minimum support to find the frequent itemsets local to S (denoted LS). The rest of the database is then used to compute the actual frequencies of each itemset in LS.

23

Page 24: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Method 5) dynamic itemset counting

(adding candidate itemsets at different points during a scan)

A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points.

In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan.

24

Page 25: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 25

Bottleneck of Frequent-pattern Mining

Multiple database scans are costly Mining long patterns needs many passes of

scanning and generates lots of candidates To find frequent itemset i1i2…i100

# of scans: 100 # of Candidates: (100

1) + (1002) + … + (1

10

00

0) =

2100-1 = 1.27*1030 !

Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

Page 26: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

mining frequent itemset without generating candidate

1) Vertical data format2) FP growth

April 21, 2023Data Mining: Concepts and

Techniques 26

Page 27: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Mining frequent itemsets using vertical data format

Question Answer

27

Page 28: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Minimum support = 2

28

Page 29: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Question) Find frequent itemsets using vertical data format.

April 21, 2023Data Mining: Concepts and

Techniques 29

Page 30: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 30

Mining Various Kinds of Association Rules

Mining multilevel association

Mining multidimensional association

Mining quantitative association

Page 31: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

31

Mining Multiple-Level Association Rules

Items often form concept hierarchies Flexible support settings

Items at the lower level are expected to have lower support

Page 32: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 32

Page 33: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules.

Various ways:1) Using uniform support for all levels2) Using reduced support at lower level3) Using items or group-based minimum

support (referred as group based support) – because users or experts often have insight as to which groups are more important than others.

33

Page 34: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

34

Page 35: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

35

Mining Multi-Dimensional Association

Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)

Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates)

age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)

hybrid-dimension assoc. rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)

Categorical Attributes: finite number of possible values, no ordering among values (occupation, color )

Quantitative Attributes: numeric, implicit ordering among values(income, age, price)—discretization, clustering, and gradient approaches

Page 36: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 36

Mining Quantitative Associations

Techniques can be categorized by how numerical attributes, such as age or salary are treated

1. Static discretization based on predefined concept hierarchies (data cube methods). This decritization occurs before mining.

2. Dynamic discretization or clustered into bins based on

the distribution of the data.

K-predicate set searching for frequent itemsets.

A k-predicate set is a set containing k conjunctive predicates.{age,occupation,buys} is a 3 predicate set.

Page 37: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 37

Static Discretization of Quantitative Attributes

Discretized prior to mining using concept hierarchy.

Numeric values are replaced by ranges. In relational database, finding all frequent k-

predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional

cuboid correspond to the

predicate sets. Mining from data cubes

can be much faster.

(income)(age)

()

(buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)

Page 38: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 38

Mining Quantitative Association Rules

age(X,”34-35”) income(X,”30-50K”) buys(X,”high resolution TV”)

Numeric attributes are dynamically discretized during mining process so as to satisfy some mining criteria, such as maximizing the confidence or compactness of the rules.

2-D quantitative association rules: Aquan1 Aquan2 Acat

Cluster adjacent association rules to form general rules using a 2-D grid Example

Page 39: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Discretization

Finding frequent predicate sets: Once the 2-D array containing the count distribution for each category is set up, it can be scanned to find the frequent predicate sets (those satisfying minimum support) that also satisfy minimum confidence. Strong association rules can then be generated from these predicate sets, using a rule generation algorithm.

April 21, 2023Data Mining: Concepts and

Techniques 39

Page 40: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

Clustering the association rules:

“Can we find a simpler rule to replace the above four rules?”

Notice that these rules are quite “close” to one another, forming a rule cluster on the grid. Indeed, the four rules can be combined or “clustered” together to form the following simpler rule, which subsumes and replaces the above four rules:

April 21, 2023Data Mining: Concepts and

Techniques 40

Page 41: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 41

Page 42: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 42

Page 43: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 43

Mining Frequent Patterns Without Candidate Generation

Grow long patterns from short ones using

local frequent items

“abc” is a frequent pattern

Get all transactions having “abc”: DB|abc

“d” is a local frequent item in DB|abc

abcd is a frequent pattern

Page 44: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 44

Construct FP-tree from a Transaction Database

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Sort frequent items in frequency descending order, f-list

3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p

Page 45: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 45

Benefits of the FP-tree Structure

Completeness Preserve complete information for frequent

pattern mining Never break a long pattern of any transaction

Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more

frequently occurring, the more likely to be shared Never be larger than the original database (not

count node-links and the count field) For Connect-4 DB, compression ratio could be

over 100

Page 46: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 46

Partition Patterns and Databases

Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b, m, p Pattern f

Completeness and non-redundency

Page 47: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 47

Find Patterns Having P From P-conditional Database

Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent

item p Accumulate all of transformed prefix paths of item p to

form p’s conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

Page 48: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 48

From Conditional Pattern-bases to Conditional FP-trees

For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of

the pattern base

m-conditional pattern base:fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

All frequent patterns relate to m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3

Page 49: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 49

Recursion: Mining Each Conditional FP-tree

{}

f:3

c:3

a:3m-conditional FP-tree

Cond. pattern base of “am”: (fc:3)

{}

f:3

c:3am-conditional FP-tree

Cond. pattern base of “cm”: (f:3){}

f:3

cm-conditional FP-tree

Cond. pattern base of “cam”: (f:3)

{}

f:3

cam-conditional FP-tree

Page 50: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 50

A Special Case: Single Prefix Path in FP-tree

Suppose a (conditional) FP-tree T has a shared single prefix-path P

Mining can be decomposed into two parts Reduction of the single prefix path into one node Concatenation of the mining results of the two

parts

a2:n2

a3:n3

a1:n1

{}

b1:m1C1:k1

C2:k2 C3:k3

b1:m1C1:k1

C2:k2 C3:k3

r1

+a2:n2

a3:n3

a1:n1

{}

r1 =

Page 51: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 51

Mining Frequent Patterns With FP-trees

Idea: Frequent pattern growth Recursively grow frequent patterns by pattern

and database partition Method

For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree

Repeat the process on each newly created conditional FP-tree

Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

Page 52: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 52

Scaling FP-growth by DB Projection

FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected

DBs Then construct and mine FP-tree for each

projected DB Parallel projection vs. Partition projection

techniques Parallel projection is space costly

Page 53: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 53

Partition-based Projection

Parallel projection needs a lot of disk space

Partition projection saves it

Tran. DB fcampfcabmfbcbpfcamp

p-proj DB fcamcbfcam

m-proj DB fcabfcafca

b-proj DB fcb…

a-proj DBfc…

c-proj DBf…

f-proj DB …

am-proj DB fcfcfc

cm-proj DB fff

Page 54: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 54

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime

(se

c.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

Page 55: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 55

FP-Growth vs. Tree-Projection: Scalability with the Support Threshold

0

20

40

60

80

100

120

140

0 0.5 1 1.5 2

Support threshold (%)

Ru

nti

me

(sec

.)

D2 FP-growth

D2 TreeProjection

Data set T25I20D100K

Page 56: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 56

Why Is FP-Growth the Winner?

Divide-and-conquer: decompose both the mining task and DB

according to the frequent patterns obtained so far

leads to focused search of smaller databases Other factors

no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and

building sub FP-tree, no pattern search and matching

Page 57: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 57

Implications of the Methodology

Mining closed frequent itemsets and max-patterns

CLOSET (DMKD’00) Mining sequential patterns

FreeSpan (KDD’00), PrefixSpan (ICDE’01) Constraint-based mining of frequent patterns

Convertible constraints (KDD’00, ICDE’01) Computing iceberg data cubes with complex measures

H-tree and H-cubing algorithm (SIGMOD’01)

Page 58: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 58

MaxMiner: Mining Max-patterns

1st scan: find frequent items A, B, C, D, E

2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE,

Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan

R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

Potential max-

patterns

Page 59: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 59

Mining Frequent Closed Patterns: CLOSET

Flist: list of all frequent items in support ascending order Flist: d-a-f-e-c

Divide search space Patterns having d Patterns having d but no a, etc.

Find frequent closed pattern recursively Every transaction having d also has cfa cfad is a

frequent closed pattern J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm

for Mining Frequent Closed Itemsets", DMKD'00.

TID Items10 a, c, d, e, f20 a, b, e30 c, e, f40 a, c, d, f50 c, e, f

Min_sup=2

Page 60: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 60

CLOSET+: Mining Closed Itemsets by Pattern-Growth

Itemset merging: if Y appears in every occurrence of X, then Y is merged with X

Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned

Hybrid tree projection Bottom-up physical tree-projection Top-down pseudo tree-projection

Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels

Efficient subset checking

Page 61: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 61

Chapter 5: Mining Frequent Patterns, Association and

Correlations Basic concepts and a road map Efficient and scalable frequent itemset

mining methods Mining various kinds of association rules From association mining to correlation

analysis

Page 62: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 62

Chapter 5: Mining Frequent Patterns, Association and

Correlations Basic concepts and a road map Efficient and scalable frequent itemset

mining methods Mining various kinds of association rules From association mining to correlation

analysis

Page 63: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 63

Interestingness Measure: Correlations (Lift)

play basketball eat cereal [40%, 66.7%] is misleading

The overall % of students eating cereal is 75% > 66.7%.

play basketball not eat cereal [20%, 33.3%] is more

accurate, although with lower support and confidence

Measure of dependent/correlated events: lift

89.05000/3750*5000/3000

5000/2000),( CBlift

Basketball

Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000)()(

)(

BPAP

BAPlift

33.15000/1250*5000/3000

5000/1000),( CBlift

Page 64: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 64

Are lift and 2 Good Measures of Correlation?

“Buy walnuts buy milk [1%, 80%]” is misleading

if 85% of customers buy milk

Support and confidence are not good to represent correlations

So many interestingness measures? (Tan, Kumar, Sritastava

@KDD’02)

Milk No Milk Sum (row)

Coffee m, c ~m, c c

No Coffee

m, ~c ~m, ~c ~c

Sum(col.)

m ~m

)()(

)(

BPAP

BAPlift

DB m, c ~m, c

m~c ~m~c lift all-conf

coh 2

A1 1000 100 100 10,000 9.26

0.91 0.83 9055

A2 100 1000 1000 100,000

8.44

0.09 0.05 670

A3 1000 100 10000

100,000

9.18

0.09 0.09 8172

A4 1000 1000 1000 1000 1 0.5 0.33 0

)sup(_max_

)sup(_

Xitem

Xconfall

|)(|

)sup(

Xuniverse

Xcoh

Page 65: UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1

April 21, 2023Data Mining: Concepts and

Techniques 65

Which Measures Should Be Used?

lift and 2 are not good measures for correlations in large transactional DBs

all-conf or coherence could be good measures (Omiecinski@TKDE’03)

Both all-conf and coherence have the downward closure property

Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)