frequent pattern miningnovel.ict.ac.cn/files/day 3.pdf · design, sale campaign analysis, web log...

1

Frequent Pattern Mining

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 2

Transaction Data Analysis

•  Transactions: customers’ purchases of commodities –  {bread, milk, cheese} if they are bought together

•  Frequent patterns: product combinations that are frequently purchased together by customers

•  Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]


Why Frequent Patterns?

•  What products were often purchased together?

•  What are the frequent subsequent purchases after buying a iPod?

•  What kinds of genes are sensitive to this new drug?

•  What key-word combinations are frequently associated with web pages about game-evaluation?


Why Frequent Pattern Mining?

•  Foundation for many data mining tasks – Association rules, correlation, causality,

sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …

•  Broad applications – Basket data analysis, cross-marketing, catalog

design, sale campaign analysis, web log (click stream) analysis, …


Frequent Itemsets

•  Itemset: a set of items –  E.g., acm = {a, c, m}

•  Support of itemsets –  Sup(acm) = 3

•  Given min_sup = 3, acm is a frequent pattern

•  Frequent pattern mining: finding all frequent patterns in a database

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

Transaction database TDB


A Naïve Attempt

•  Generate all possible itemsets, test their supports against the database

•  How to hold a large number of itemsets into main memory? – 100 items à 2100 – 1 possible itemets

•  How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the

support of 220 – 1 = 1,048,575 itemsets

2


Transactions in Real Applications

•  A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books

relevant to data mining •  Walmart has more than 20 million

transactions per day, AT&T produces more than 275 million calls per day

•  Mining large transaction databases of many items is a real demand


How to Get an Efficient Method?

•  Reducing the number of itemsets that need to be checked

•  Checking the supports of selected itemsets efficiently


Candidate Generation & Test

•  Any subset of a frequent itemset must also be frequent – an anti-monotonic property –  A transaction containing {beer, diaper, nuts} also

contains {beer, diaper} –  {beer, diaper, nuts} is frequent à {beer, diaper} must

also be frequent •  In other words, any superset of an infrequent

itemset must also be infrequent –  No superset of any infrequent itemset should be

generated or tested –  Many item combinations can be pruned!


Apriori-Based Mining

•  Generate length (k+1) candidate itemsets from length k frequent itemsets, and

•  Test the candidates against DB


The Apriori Algorithm [AgSr94]

TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2

Itemset Sup a 2 b 3 c 3 d 1 e 3

Data base D 1-candidates

Scan D

Itemset Sup a 2 b 3 c 3 e 3

Freq 1-itemsets Itemset

ab ac ae bc be ce

2-candidates

Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2

Counting

Scan D

Itemset Sup ac 2 bc 2 be 3 ce 2

Freq 2-itemsets Itemset

bce

3-candidates

Itemset Sup bce 2

Freq 3-itemsets

Scan D


The Apriori Algorithm Level-wise, candidate generation and test •  Ck: Candidate itemset of size k •  Lk : frequent itemset of size k

•  L1 = {frequent items}; •  for (k = 1; Lk !=∅; k++) do

–  Ck+1 = candidates generated from Lk; –  for each transaction t in database do increment the

count of all candidates in Ck+1 that are contained in t –  Lk+1 = candidates in Ck+1 with min_support

•  return ∪k Lk;

Candidate generation

Test

3


Important Steps in Apriori

•  How to find frequent 1- and 2-itemsets? •  How to generate candidates?

– Step 1: self-joining Lk

– Step 2: pruning •  How to count supports of candidates?


Finding Frequent 1- & 2-itemsets

•  Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array –  Initialize c[item]=0 for each item – For each transaction T, for each item in T,

c[item]++; –  If c[item]>=min_sup, item is frequent

•  Finding frequent 2-itemsets using a 2-dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij


Counting Array

•  A 2-dimensional triangle matrix can be implemented using a 1-dimensional array

1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5

There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/2+5-3]=c[9]

1 2 3 4 5 6 7 8 9 10 Jian Pei: Big Data Analytics -- Frequent Pattern Mining 16

Example of Candidate-generation

•  L3 = {abc, abd, acd, ace, bcd} •  Self-joining: L3*L3

– abcd ß abc * abd – acde ß acd * ace

•  Pruning: – acde is removed because ade is not in L3

•  C4={abcd}


How to Generate Candidates? •  Suppose the items in Lk-1 are listed in an order •  Step 1: self-join Lk-1

INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

•  Step 2: pruning –  For each itemset c in Ck do

•  For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck


How to Count Supports?

•  Why is counting supports of candidates a problem? –  The total number of candidates can be very huge –  One transaction may contain many candidates

•  Method –  Candidate itemsets are stored in a hash-tree –  A leaf node of hash-tree contains a list of itemsets and

counts –  Interior node contains a hash table –  Subset function: finds all the candidates contained in a

transaction

4


Example: Counting Supports

1,4,7 2,5,8

3,6,9 Subset function

2 3 4 5 6 7

1 4 5 1 3 6

1 2 4 4 5 7 1 2 5

4 5 8 1 5 9

3 4 5 3 5 6 3 5 7 6 8 9

3 6 7 3 6 8

Transaction: 1 2 3 5 6

1 + 2 3 5 6

1 2 + 3 5 6

1 3 + 5 6


Apriori in SQL

•  Impossible to get good performance out of pure SQL (SQL-92) based approaches alone – Support counting is costly

•  Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. – Get orders of magnitude improvement

•  S. Sarawagi, S. Thomas, and R. Agrawal, 1998


Challenges of Freq Pat Mining

•  Multiple scans of transaction database •  Huge number of candidates •  Tedious workload of support counting for

candidates


Improving Apriori: Ideas

•  Reducing the number of transaction database scans

•  Shrinking the number of candidates •  Facilitating support counting of candidates


DIC: Reducing Number of Scans ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{} Itemset lattice

•  Once both A and D are determined frequent, the counting of AD can begin

•  Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin

Transactions 1-itemsets 2-itemsets

… Apriori

1-itemsets 2-items

3-items DIC S. Brin R. Motwani, J. Ullman, and S. Tsur, SIGMOD’97. DIC: Dynamic Itemset Counting


DHP: Reducing # of Candidates

•  A hashing bucket count < min_sup à every candidate in the bucket is infrequent – Candidates: a, b, c, d, e – Hash entries: {ab, ad, ae} {bd, be, de} … – Large 1-itemset: a, b, d, e – The sum of counts of {ab, ad, ae} < min_sup à

ab should not be a candidate 2-itemset •  J. Park, M. Chen, and P. Yu, SIGMOD’95

– DHP: Direct Hashing and Pruning

5


A 2-Scan Method by Partitioning •  Partition the database into n partitions, such that

each partition can be held into main memory •  Itemset X is frequent à X must be frequent in at

least one partition –  Scan 1: partition database and find local frequent

patterns –  Scan 2: consolidate global frequent patterns

•  All local frequent itemsets can be held in main memory? A sometimes too strong assumption

•  A. Savasere, E. Omiecinski, and S. Navathe, VLDB’95


Sampling for Frequent Patterns

•  Select a sample of the original database, mine frequent patterns in the sample using Apriori

•  Scan database once more to verify frequent itemsets found in the sample, only borders of closure of frequent patterns are checked – Example: check abcd instead of ab, ac, …, etc.

•  Scan database again to find missed frequent patterns

•  H. Toivonen, VLDB’96


Eclat/MaxEclat and VIPER

•  Tid-list: the list of transaction-ids containing an itemset –  Vertical Data Format

•  Major operation: intersections of tid-lists •  Compression of tid-lists

–  Itemset A: t1, t2 t3, sup(A)=3 –  Itemset B: t2, t3, t4, sup(B)=3 –  Itemset AB: t2, t3, sup(AB)=2

•  M. Zaki et al., 1997 •  P. Shenoy et al., 2000


Bottleneck of Freq Pattern Mining

•  Multiple database scans are costly •  Mining long patterns needs many scans and

generates many candidates – To find frequent itemset i1i2…i100

•  # of scans: 100 •  # of Candidates:

– Bottleneck: candidate-generation-and-test •  Can we avoid candidate generation?

30100 1027.112100100

2100

1100

×≈−=⎟⎟⎠

⎞⎜⎜⎝

⎛++⎟⎟

⎠

⎞⎜⎜⎝

⎛+⎟⎟⎠

⎞⎜⎜⎝

⎛!


Search Space of Freq. Pat. Mining

•  Itemsets form a lattice ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice Jian Pei: Big Data Analytics -- Frequent Pattern Mining 30

Set Enumeration Tree

•  Use an order on items, enumerate itemsets in lexicographic order –  a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d

•  Reduce a lattice to a tree ∅

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd

Set enumeration tree

6


Borders of Frequent Itemsets

•  Frequent itemsets are connected – ∅ is trivially frequent – X on the border à every subset of X is frequent

∅

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd Jian Pei: Big Data Analytics -- Frequent Pattern Mining 32

Projected Databases

•  To test whether Xy is frequent, we can use the X-projected database – The sub-database of transactions containing X – Check whether item y is frequent in X-projected

database ∅

a b c d ab ac ad bc bd cd

abc abd acd bcd

abcd


Compress Database by FP-tree •  The 1st scan: find

frequent items –  Only record frequent

items in FP-tree –  F-list: f-c-a-b-m-p

•  The 2nd scan: construct tree –  Order frequent items in

each transaction w.r.t. f-list

–  Explore sharing among transactions

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

TID Items bought (ordered) freq items

100 f, a, c, d, g, I, m, p f, c, a, m, p

200 a, b, c, f, l,m, o f, c, a, b, m

300 b, f, h, j, o f, b

400 b, c, k, s, p c, b, p

500 a, f, c, e, l, p, m, n f, c, a, m, p


Benefits of FP-tree

•  Completeness –  Never break a long pattern in any transaction –  Preserve complete information for freq pattern mining

•  Not scan database anymore

•  Compactness –  Reduce irrelevant info — infrequent items are removed –  Items in frequency descending order (f-list): the more

frequently occurring, the more likely to be shared –  Never be larger than the original database (not counting

node-links and the count fields)


Partitioning Frequent Patterns

•  Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f

•  Depth-first search of a set enumeration tree – The partitioning is complete and does not have

any overlap Jian Pei: Big Data Analytics -- Frequent Pattern Mining 36

•  Only transactions containing p are needed •  Form p-projected database

– Starting at entry p of the header table – Follow the side-link of frequent item p – Accumulate all transformed prefix paths of p

Find Patterns Having Item “p”

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

p-projected database TDB|p fcam: 2 cb: 1

Local frequent item: c:3 Frequent patterns containing p

p: 3, pc: 3

7


Find Pat Having Item m But No p

•  Form m-projected database TDB|m –  Item p is excluded (why?) – Contain fca:2, fcab:1 – Local frequent items: f, c, a

•  Build FP-tree for TDB|m root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

Header table item

f c a

root

f:3

c:3

a:3

m-projected FP-tree


Recursive Mining

•  Patterns having m but no p can be mined recursively

•  Optimization: enumerate patterns from a single-branch FP-tree – Enumerate all combination – Support = that of the last item

•  m, fm, cm, am •  fcm, fam, cam •  fcam

Header table item

f c a

root

f:3

c:3

a:3

m-projected FP-tree


Enumerate Patterns From Single Prefix of FP-tree •  A (projected) FP-tree has a single prefix

– Reduce the single prefix into one node – Join the mining results of the two parts

Ú a2:n2

a3:n3

a1:n1

root

b1:m1 c1:k1

c2:k2 c3:k3

+ a2:n2

a3:n3

a1:n1

root

r =

r1

b1:m1 c1:k1

c2:k2 c3:k3


FP-growth

•  Pattern-growth: recursively grow frequent patterns by pattern and database partitioning

•  Algorithm –  For each frequent item, construct its projected

database, and then its projected FP-tree –  Repeat the process on each newly created projected

FP-tree –  Until the resulted FP-tree is empty, or contains only one

path – single path generates all the combinations, each of which is a frequent pattern


Scaling up by DB Projection

•  What if an FP-tree cannot fit into memory? •  Database projection

– Partition a database into a set of projected databases

– Construct and mine FP-tree once the projected database can fit into main memory

•  Heuristic: Projected database shrinks quickly in many applications


Parallel vs. Partition Projection

•  Parallel projection: form all projected database at a time

•  Partition projection: propagate projections

Tran. DB fcamp fcabm fb cbp fcamp

p-proj DB fcam cb fcam

m-proj DB fcab fca fca

b-proj DB f cb …

a-proj DB fc …

c-proj DB f …

f-proj DB …

am-proj DB fc fc fc

cm-proj DB f f f

…

8


Why Is FP-growth Efficient?

•  Divide-and-conquer strategy – Decompose both the mining task and DB – Lead to focused search of smaller databases

•  Other factors – No candidate generation nor candidate test – Database compression using FP-tree – No repeated scan of entire database – Basic operations – counting local frequent items

and building FP-tree, no pattern search nor pattern matching


Major Costs in FP-growth

•  Poor locality of FP-trees – Low hit rate of cache

•  Building FP-trees – A stack of FP-trees

•  Redundant information – Transaction abcd appears in a-, ab-, abc-, ac-, …, c- projected databases and FP-trees


Improving Locality

•  Store FP-trees in pre-order depth-first traverse list

Ghoting et al., VLDB05


H-Mine

•  Goal: efficient in various occasions – Dense vs. sparse, huge vs. memory-based data

sets •  Moderate in space requirement •  Highlights

– Effective and efficient memory-based structure and mining algorithm

– Scalable algorithm for mining large databases by proper partitioning

–  Integration of H-mine and FP-growth


H-Structure •  Store frequent-item projections in main memory

–  Items in a transaction are sorted according to f-list –  Each frequent item in a transaction is stored with two

fields: item-id and hyper-link –  Header table H

•  Link transactions with same first item •  Scan database once Tid Items Freq-item

projection 100 c, d, e, f, g, i c, d, e, g 200 a, c, d, e, m a, c, d, e 300 a, b, d, e, g, k a, d, e, g 400 a, c, d, h a, c, d

F-list = a-c-d-e-g

Headertable H

frequentprojections

400

300

200

100 c d e g

edca

d

da

c

3 2a c e gd

3 4 3

ge

a


Find Patterns Containing Item “a”

•  Only search a-projected database: transactions containing “a”

•  The a-queue links all transactions in a-projected database – Can be traversed efficiently

Headertable H

frequentprojections

400

300

200

100 c d e g

edca

d

da

c

3 2a c e gd

3 4 3

ge

a

9

•  Build a-header table Ha •  Traverse a-queue once, find all local frequent

items within a-projected database


Mining a-Projected Database

–  Local freq items: c, d, and e

–  Patterns: ac, ad and ae

•  Link transactions having same next frequent item

3 2a c e gd

3 4 3Headertable H

frequentprojections

400

300

200

100 c d e g

edca

d

da

c

ge

a

1c e gd2 3

Headertable Ha 2


Why Is H-Mine(Mem) Efficient?

•  No candidate generation –  It is a pattern growth method

•  Search confined in a dedicated space – Not physically construct memory structures for

projected databases – H-struct is for all the mining –  Information about projected databases are

collected in header tables •  No frequent patterns stored in main memory


Mining in Large Databases

•  What if the H-struct is too big for memory? •  Find global frequent items •  Partition the database into n parts

– The H-struct for each part can be held into memory

– Mine local patterns in each part using H-mine(Mem)

•  Use relative minimum support threshold

•  Consolidate global patterns in the third scan Jian Pei: Big Data Analytics -- Frequent Pattern Mining 52

How to Partition in H-mine?

•  Partitioning in H-mine is straightforward – Overhead of header tables in H-mine(Mem) is

small and predictable – Partitioning with Apriori is not easy

•  Hard to predict the space requirement of Apriori

•  Global frequent items prune many local patterns in skewed partitions


Mining Dense Projected DB’s

•  Challenges in dense datasets –  Long patterns –  Some patterns appearing in many transactions

•  After projection, projected databases are denser •  Advantages of FP-tree

–  Compress dense databases many times –  No candidate generation –  Sub-patterns can be enumerated from long patterns

•  Build FP-tree for dense projected databases –  Empirical switching point: 1%


Advantages of H-Mine

•  Have very small space overhead •  Absorb nice features of FP-growth •  Create no physical projected database •  Watch the density of projected databases,

turn to FP-growth when necessary •  Propose space-preserving mining

– Scalable in very large database – Feasible even with very small memory – Go beyond frequent pattern mining

10


Further Developments

•  OP – opportunistic projection (LPWH02) – Opportunistically choose between array-based

and tree-based representations of projected databases

•  Diffsets for vertical mining (ZaGo03) – Only record the differences in the tids of a

candidate pattern from its generating frequent patterns

Effectiveness of Freq Pat Mining

•  Too many patterns! – A pattern a1a2…an contains 2n-1 subpatterns – Understanding many patterns is difficult or even

impossible for human users •  Non-focused mining

– A manager may be only interested in patterns involving some items (s)he manages

– A user is often interested in patterns satisfying some constraints


Itemset Lattice ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD, ACD

Min_sup=2


Max-Patterns ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD

Min_sup=2


Borders and Max-patterns

•  Max-patterns: borders of frequent patterns – Any subset of max-pattern is frequent – Any superset of max-pattern is infrequent – Cannot generate rules ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{} Jian Pei: Big Data Analytics -- Frequent Pattern Mining 59

MaxMiner: Mining Max-patterns

•  1st scan: find frequent items – A, B, C, D, E

•  2nd scan: find support for – AB, AC, AD, AE, ABCDE – BC, BD, BE, BCDE – CD, CE, CDE, DE,

•  Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan

•  Bayardo, SIGMOD’98

Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F

Potential max-patterns

Min_sup=2


11

Patterns and Support Counts ABCD

ABC:2 ABD:2 ACD BCD

AB:3 AC:2 BC:2 AD:3 BD:2 CD:2

A:4 B:4 C:3 D:4

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Len Frequent itemsets 1 A:4, B:4, C:3, D:4 2 AB:3, AC:2, AD:3, BC:3, BD:2, CD:2 3 ABC:2, ABD:2

Min_sup=2


Frequent Closed Patterns

•  For frequent itemset X, if there exists no item y not in X s.t. every transaction containing X also contains y, then X is a frequent closed pattern – “acdf” is a frequent closed pattern

•  Concise rep. of freq pats – Can generate non-redundant rules

•  Reduce # of patterns and rules •  N. Pasquier et al. In ICDT’99

TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f

Min_sup=2


CLOSET for Frequent Closed Patterns

•  Flist: list of all freq items in support asc. order –  Flist: d-a-f-e-c

•  Divide search space –  Patterns having d –  Patterns having d but no a, etc.

•  Find frequent closed pattern recursively –  Every transaction having d also has cfa à cfad is a

frequent closed pattern •  PHM’00

TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f

Min_sup=2


The CHARM Method

•  Use vertical data format: t(AB)={T1, T12, …} •  Derive closed pattern based on vertical

intersections –  t(X)=t(Y): X and Y always happen together –  t(X)⊂t(Y): transaction having X always has Y

•  Use diffset to accelerate mining – Only keep track of difference of tids –  t(X)={T1, T2, T3}, t(Xy )={T1, T3} – Diffset(Xy, X)={T2}


Closed and Max-patterns

•  Closed pattern mining algorithms can be adapted to mine max-patterns – A max-pattern must be closed

•  Depth-first search methods have advantages over breadth-first search ones – Why?


Condensed Freq Pattern Base

•  Practical observation: in many applications, a good approximation on support count could be good enough –  Support=10000 à Support in range 10000 ± 1%

•  Making frequent pattern mining more realistic –  A small deviation has a minor effect on analysis –  Condensed FP-base leads to more effective mining –  Computing a condensed FP-base may lead to more

efficient mining


12

Condensed FP-base Mining

•  Compute a condensed FP-base with a guaranteed maximal error bound.

•  Given: a transaction database, a user-specified support threshold, and a user-specified error bound

•  Find a subset of frequent patterns & a function –  Determine whether a pattern is frequent –  Determine the support range

•  Pei et al. ICDM’02


An Example Support threshold: min_sup

= 1 Error bound: k = 2


Another Base

Support threshold: min_sup = 1 Error bound: k = 2


Approximation Functions

•  NOT unique – Different condensed FP-bases have different

approximation function •  Optimization on space requirement

– The less space required, the better compression effect

– compression ratio


Constraint-based Data Mining

•  Find all the patterns in a database autonomously? –  The patterns could be too many but not focused!

•  Data mining should be interactive –  User directs what to be mined

•  Constraint-based mining –  User flexibility: provides constraints on what to be mined –  System optimization: push constraints for efficient

mining


Constraints in Data Mining

•  Knowledge type constraint –  classification, association, etc.

•  Data constraint — using SQL-like queries –  find product pairs sold together in stores in New York

•  Dimension/level constraint –  in relevance to region, price, brand, customer category

•  Rule (or pattern) constraint –  small sales (price < $10) triggers big sales (sum >$200)

•  Interestingness constraint –  strong rules: support and confidence

13

Constrained Mining vs. Search

•  Constrained mining vs. constraint-based search –  Both aim at reducing search space –  Finding all patterns vs. some (or one) answers satisfying

constraints –  Constraint-pushing vs. heuristic search –  An interesting research problem on integrating both

•  Constrained mining vs. DBMS query processing –  Database query processing requires to find all –  Constrained pattern mining shares a similar philosophy

as pushing selections deeply in query processing


Optimization

•  Mining frequent patterns with constraint C –  Sound: only find patterns satisfying the constraints C –  Complete: find all patterns satisfying the constraints C

•  A naïve solution –  Constraint test as a post-processing

•  More efficient approaches –  Analyze the properties of constraints –  Push constraints as deeply as possible into frequent

pattern mining


Anti-Monotonicity

•  Anti-monotonicity – An intemset S violates the constraint, so does

any of its superset – sum(S.Price) ≤ v is anti-monotone – sum(S.Price) ≥ v is not anti-monotone

•  Example – C: range(S.profit) ≤ 15 –  Itemset ab violates C – So does every superset of ab

TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g

TDB (min_sup=2)

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10


Anti-monotonic Constraints Constraint Antimonotone

v ∈ S No S ⊆ V no S ⊆ V yes

min(S) ≤ v no min(S) ≥ v yes max(S) ≤ v yes max(S) ≥ v no

count(S) ≤ v yes count(S) ≥ v no

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no

range(S) ≤ v yes range(S) ≥ v no

avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ yes support(S) ≤ ξ no


Monotonicity

•  Monotonicity – An intemset S satisfies the constraint, so does

any of its superset – sum(S.Price) ≥ v is monotone – min(S.Price) ≤ v is monotone

•  Example – C: range(S.profit) ≥ 15 –  Itemset ab satisfies C – So does every superset of ab


TDB (min_sup=2)



Monotonic Constraints Constraint Monotone

v ∈ S yes S ⊆ V yes S ⊆ V no

min(S) ≤ v yes min(S) ≥ v no max(S) ≤ v no max(S) ≥ v yes

count(S) ≤ v no count(S) ≥ v yes

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) no sum(S) ≥ v ( a ∈ S, a ≥ 0 ) yes

range(S) ≤ v no range(S) ≥ v yes

avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ no support(S) ≤ ξ yes


14

Converting “Tough” Constraints

•  Convert tough constraints into anti-monotone or monotone by properly ordering items

•  Examine C: avg(S.profit) ≥ 25 –  Order items in value-descending order

•  <a, f, g, d, b, h, c, e>

–  If an itemset afb violates C •  So does afbh, afb* •  It becomes anti-monotone!


TDB (min_sup=2)



Convertible Constraints

•  Let R be an order of items •  Convertible anti-monotone

–  If an itemset S violates a constraint C, so does every itemset having S as a prefix w.r.t. R

–  Ex. avg(S) ≤ v w.r.t. item value descending order •  Convertible monotone

–  If an itemset S satisfies constraint C, so does every itemset having S as a prefix w.r.t. R

–  Ex. avg(S) ≥ v w.r.t. item value descending order


Strongly Convertible Constraints •  avg(X) ≥ 25 is convertible anti-monotone

w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> –  Itemset af violates a constraint C, so does

every itemset with af as prefix, such as afd •  avg(X) ≥ 25 is convertible monotone

w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> –  Itemset d satisfies a constraint C, so does

itemsets df and dfa, which having d as a prefix

•  Thus, avg(X) ≥ 25 is strongly convertible



Convertible Constraints

Constraint Convertible anti-monotone

Convertible monotone

Strongly convertible

avg(S) ≤ , ≥ v Yes Yes Yes

median(S) ≤ , ≥ v Yes Yes Yes

sum(S) ≤ v (items could be of any value, v ≥ 0) Yes No No

sum(S) ≤ v (items could be of any value, v ≤ 0) No Yes No

sum(S) ≥ v (items could be of any value, v ≥ 0) No Yes No

sum(S) ≥ v (items could be of any value, v ≤ 0) Yes No No


Can Apriori Handle Convertible Constraint? •  A convertible, not monotone nor anti-

monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm –  Within the level wise framework, no direct

pruning based on the constraint can be made –  Itemset df violates constraint C: avg(X)>=25 –  Since adf satisfies C, Apriori needs df to

assemble adf, df cannot be pruned •  But it can be pushed into frequent-pattern

growth framework!

Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10


Mining With Convertible Constraints •  C: avg(S.profit) ≥ 25 •  List of items in every transaction in

value descending order R: –  <a, f, g, d, b, h, c, e> –  C is convertible anti-monotone w.r.t. R

•  Scan transaction DB once –  remove infrequent items

•  Item h in transaction 40 is dropped

–  Itemsets a and f are good

TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e

TDB (min_sup=2)

Item Profit a 40 f 30 g 20 d 10 b 0 h -10 c -20 e -30

15

Not Every Pattern Is Interesting!

•  Trivial patterns – Pregnant à Female 100% confidence

•  Misleading patterns – Play basketball à eat cereal [40%, 66.7%]


Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

Evaluation Criteria

•  Objective interestingness measures – Examples: support, patterns formed by mutually

independent items – Domain independent

•  Subjective measures – Examples: domain knowledge, templates/

constraints



Correlation and Lift

•  P(B|A)/P(B) is called the lift of rule A à B

•  Play basketball à eat cereal (lift: 0.89) •  Play basketball à not eat cereal (lift: 1.33)

corrA,B =P(A∪B)P(A)P(B)

=P(AB)

P(A)P(B)


Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

Contingency table 372 Chapter 6 Association Analysis

Table 6.7. A 2-way contingency table for variables A and B.

B B

A f11 f10 f1+

A f01 f00 f0+

f+1 f+0 N

counts tabulated in a contingency table. Table 6.7 shows an example of acontingency table for a pair of binary variables, A and B. We use the notationA (B) to indicate that A (B) is absent from a transaction. Each entry fij inthis 2 × 2 table denotes a frequency count. For example, f11 is the number oftimes A and B appear together in the same transaction, while f01 is the num-ber of transactions that contain B but not A. The row sum f1+ representsthe support count for A, while the column sum f+1 represents the supportcount for B. Finally, even though our discussion focuses mainly on asymmet-ric binary variables, note that contingency tables are also applicable to otherattribute types such as symmetric binary, nominal, and ordinal variables.

Limitations of the Support-Confidence Framework Existing associa-tion rule mining formulation relies on the support and confidence measures toeliminate uninteresting patterns. The drawback of support was previously de-scribed in Section 6.8, in which many potentially interesting patterns involvinglow support items might be eliminated by the support threshold. The draw-back of confidence is more subtle and is best demonstrated with the followingexample.

Example 6.3. Suppose we are interested in analyzing the relationship be-tween people who drink tea and coffee. We may gather information about thebeverage preferences among a group of people and summarize their responsesinto a table such as the one shown in Table 6.8.

Table 6.8. Beverage preferences among a group of 1000 people.

Coffee Coffee

Tea 150 50 200

Tea 650 150 800

800 200 1000

Property of Lift

•  If A and B are independent, lift = 1 •  If A and B are positively correlated, lift > 1 •  If A and B are negatively correlated, lift < 1 •  Limitation: lift is sensitive to P(A) and P(B)


374 Chapter 6 Association Analysis

Table 6.9. Contingency tables for the word pairs ({p,q} and {r,s}.

p p r r

q 880 50 930 s 20 50 70

q 50 20 70 s 50 880 930

930 70 1000 70 930 1000

This equation follows from the standard approach of using simple fractionsas estimates for probabilities. The fraction f11/N is an estimate for the jointprobability P (A, B), while f1+/N and f+1/N are the estimates for P (A) andP (B), respectively. If A and B are statistically independent, then P (A, B) =P (A) × P (B), thus leading to the formula shown in Equation 6.6. UsingEquations 6.5 and 6.6, we can interpret the measure as follows:

I(A, B)

⎧⎨

⎩

= 1, if A and B are independent;> 1, if A and B are positively correlated;< 1, if A and B are negatively correlated.

(6.7)

For the tea-coffee example shown in Table 6.8, I = 0.150.2×0.8 = 0.9375, thus sug-

gesting a slight negative correlation between tea drinkers and coffee drinkers.

Limitations of Interest Factor We illustrate the limitation of interestfactor with an example from the text mining domain. In the text domain, itis reasonable to assume that the association between a pair of words dependson the number of documents that contain both words. For example, becauseof their stronger association, we expect the words data and mining to appeartogether more frequently than the words compiler and mining in a collectionof computer science articles.

Table 6.9 shows the frequency of occurrences between two pairs of words,{p, q} and {r, s}. Using the formula given in Equation 6.5, the interest factorfor {p, q} is 1.02 and for {r, s} is 4.08. These results are somewhat troublingfor the following reasons. Although p and q appear together in 88% of thedocuments, their interest factor is close to 1, which is the value when p and qare statistically independent. On the other hand, the interest factor for {r, s}is higher than {p, q} even though r and s seldom appear together in the samedocument. Confidence is perhaps the better choice in this situation because itconsiders the association between p and q (94.6%) to be much stronger thanthat between r and s (28.6%).

lift(p, q) < lift(r, s)!

Leverage

•  The difference between the observed and expected joint probability of XY assuming X and Y are independent

•  An “absolute” measure of the surprisingness of a rule – Should be used together with lift


leverage(X ! Y ) = P (XY )� P (X)P (Y )

Convinction

•  The expected error of a rule

•  Consider not only the joint distribution of X and Y


conv(X ! Y ) =P (X)P (Y )

P (XY )=

1

lift(X ! Y )

16

Odds Ratio


odds(Y | X) =

P (XY )P (X)

P (XY )P (X)

=P (XY )

P (XY )

odds(Y | X) =

P (XY )P (X)

P (XY )P (X)

=P (XY )

P (XY )

oddsratio(X ! Y ) =odds(Y | X)

odds(Y | X)=

P (XY ) · P (XY )

P (XY ) · P (XY )

χ2

•  Suppose attribute A has c distinct values a1, …, ac and attribute B has r distinct values b1, …, br

•  The χ2 value (Pearson χ2 statistics) is

– oij and eij are the observed frequency and the expected frequency, respectively, of the joint event aibj, respectively


�

2 =cX

i=1

rX

j=1

(oij � eij)2

eij

Example



Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

•  The χ2 value is greater than 1 •  count(basketball, cereal) = 2000 <

expectation (2250) à play basketball and eating cereal are negatively correlated

�2 =(2000� 2250)2

2250+

(1750� 1500)2

1500+

(1000� 750)2

750+

(250� 500)2

500= 277.8

Φ-coefficient

•  –1: if A and B perfectly negatively correlated •  1: if A and B perfectly positively correlated •  0: if A and B statistically independent •  Drawback: Φ-coefficient puts the same

weight on co-occurrence and co-absence


� =P (AB)P (AB)� P (AB)P (AB)p

P (A)P (B)P (A)P (B)

IS Measure

•  Biased on frequent co-occurrence •  Equivalent to cosine similarity for binary

variables (bit vectors) •  Geometric mean of rules between a pair of

binary random variables

•  Drawback: the value depends on P(A) and P(B) – Similar drawbacks in lift


IS(A,B) =plift(A,B)P (A,B) =

P (A,B)pP (A)P (B)

IS(A,B) =

sP (A,B)

P (A)

P (A,B)

P (B)=

pconf(A ! B)conf(B ! A)

More Measures

•  All confidence: min{ P(A|B), P(B|A) } •  Max confidence: max{ P(A|B), P(B|A) } •  The Kulczynski measure: ½ (P(A|B) + P(B|

A)

17


Comparing Measures Milk No Milk Sum (row)

Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m Σ

Contingency table

Transaction databases and their contingency tables

30CHAPTER 6. MINING FREQUENTPATTERNS, ASSOCIATIONS, AND CORRELATIONS: BASIC

Table 6.9: Comparison of six pattern evaluation measures using contingency tables for a varietyof data sets.

Data Set mc mc mc mc χ2 lift all conf. max conf. Kulc. cosineD1 10,000 1,000 1,000 100,000 90557 9.26 0.91 0.91 0.91 0.91D2 10,000 1,000 1,000 100 0 1 0.91 0.91 0.91 0.91D3 100 1,000 1,000 100,000 670 8.44 0.09 0.09 0.09 0.09D4 1,000 1,000 1,000 100,000 24740 25.75 0.5 0.5 0.5 0.5D5 1,000 100 10,000 100,000 8173 9.18 0.09 0.91 0.5 0.29D6 1,000 10 100,000 100,000 965 1.97 0.01 0.99 0.5 0.10

positively associated in both data sets by producing a measure value of 0.91.However, lift and χ2 generate dramatically different measure values for D1 andD2 due to their sensitivity to mc. In fact, in many real-world scenarios mc isusually huge and unstable. For example, in a market basket database, the totalnumber of transactions could fluctuate on a daily basis and overwhelmingly ex-ceed the number of transactions containing any particular itemset. Therefore, agood interestingness measure should not be affected by transactions that do notcontain the itemsets of interest; otherwise, it would generate unstable results asillustrated in D1 and D2.

Similarly, in D3, the four new measures correctly show that m and c arestrongly negatively associated because the ratio of m to c equals the ratio ofmc to m, that is 100/1100 = 9.1%. However, lift and χ2 both contradict this inan incorrect way: their values for D2 are between those for D1 and D3.

For data set D4, both lift and χ2 indicate a highly positive associationbetween m and c, whereas the others indicate a “neutral” association becausethe ratio of mc to mc equals the ratio of mc to mc, which is 1. This means thatif a customer buys coffee (or milk), the probability that she will also purchasemilk (or coffee) is exactly 50%.

“Why are lift and χ2 so poor at distinguishing pattern association relation-ships in the above transactional data sets?” To answer this, we have to considerthe null-transactions. A null-transaction is a transaction that does not con-tain any of the itemsets being examined. In our example, mc represents thenumber of null-transactions. Lift and χ2 have difficulty distinguishing interest-ing pattern association relationships because they are both strongly influencedby mc. Typically, the number of null-transactions can outweigh the numberof individual purchases, because many people may buy neither milk nor coffee.On the other hand, the other four measures are good indicators of interestingpattern associations because their definitions remove the influence of mc (thatis, they are not influenced by the number of null-transactions).

The above discussion shows that it is highly desirable to have a measurewhose value is independent of the number of null-transactions. A measure isnull-invariant if its value is free from the influence of null-transactions. Null-

χ2 and lift do not perform well on those data sets, since they are sensitive to ~m~c

Imbalance Ration

•  Assess the imbalance of two itemsets A and B in rule implications


IR(A,B) =|P (A)� P (B)|

P (A) + P (B)� P (A [B)

Properties of Measures

•  Symmetry: is M(AàB) = M(BàA) •  Null-transaction dependent (null addition

invariant): is ~A~B used in the measure? •  Inversion invariant: the value does not

change if f11 and f10 are exchanged with f00 and f01

•  Scaling: whether the measure remains if the contingency table [f11, f10, f01, f00] is changed to [k1k3f11, k2k3f10, k1k4f01, k2k4f00]?


Measuring 3 Random Variables

•  3 dimensional contingency table

•  For a k-itemset {i1, i2, …, ik}, the condition for statistical independence is


6.7 Evaluation of Association Patterns 383

Table 6.18. Example of a three-dimensional contingency table.

c b b c b b

a f111 f101 f1+1 a f110 f100 f1+0

a f011 f001 f0+1 a f010 f000 f0+0

f+11 f+01 f++1 f+10 f+00 f++0

such as f1+1 is the number of transactions that contain a and c, irrespectiveof whether b is present in the transaction.

Given a k-itemset {i1, i2, . . . , ik}, the condition for statistical independencecan be stated as follows:

fi1i2...ik =fi1+...+ × f+i2...+ × . . . × f++...ik

Nk−1. (6.12)

With this definition, we can extend objective measures such as interest factorand PS, which are based on deviations from statistical independence, to morethan two variables:

I =Nk−1 × fi1i2...ik

fi1+...+ × f+i2...+ × . . . × f++...ik

PS =fi1i2...ik

N− fi1+...+ × f+i2...+ × . . . × f++...ik

Nk

Another approach is to define the objective measure as the maximum, min-imum, or average value for the associations between pairs of items in a pat-tern. For example, given a k-itemset X = {i1, i2, . . . , ik}, we may define theφ-coefficient for X as the average φ-coefficient between every pair of items(ip, iq) in X. However, because the measure considers only pairwise associa-tions, it may not capture all the underlying relationships within a pattern.

Analysis of multidimensional contingency tables is more complicated be-cause of the presence of partial associations in the data. For example, someassociations may appear or disappear when conditioned upon the value of cer-tain variables. This problem is known as Simpson’s paradox and is describedin the next section. More sophisticated statistical techniques are available toanalyze such relationships, e.g., loglinear models, but these techniques arebeyond the scope of this book.

fi1i2···ik =fi1+···+f+i2···+ · · · f++···ik

Nk�1

Measuring More Random Variables

•  Some measures, such as lift and statistical independence, can be extended


I =Nk�1fi1i2···ik

fi1+···+f+i2···+ · · · f+···+ik

PS =fi1i2···ik

N� fi1+···+f+i2···+ · · · f+···+ik

Nk

Simpson’s Paradox

•  A trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data – Also known as Yule-Simpson effect – Often encountered in social-science and

medical-science statistics – Particularly confounding when frequency data

are unduly given causal interpretations


18

Kidney Stone Treatment Example

•  Which treatment, A or B, is better?


Treatment A Treatment B Small stones G1: 81/87=93% G2: 234/270=87% Large stones G3: 192/263=73% G4: 55/80=69%

Overall 273/350=78% 289/350=83%

Berkeley Gender Bias Case


Applicants Admitted Men 8442 44%

Women 4321 35%

Department Men Women Applicants Admitted Applicants Admitted

A 825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 272 6% 341 7%

Fisher Exact Test

•  Directly test whether a rule X à Y is productive by comparing its confidence with those of its generalizations W à Y, where W is a subset of X – Let X = W U Z


W Y Not Y Z a b a + b

Not Z c d c + d a + c b + d Sup(W)

a = sup(WZY ) = sup(XY ), b = sup(WZY ) = sup(XY )c = sup(WZY ), d = sup(WZY )

Marginals

•  Row marginals

•  Column marginals




a+ b = sup(WZ) = sup(X), c+ d = sup(WZ)

a+ c = sup(WY ), b+ d = sup(WY )

oddratio =

aa+bb

a+bc

c+dd

c+d

=ad

bc

Hypothesis

•  H0: Z and Y are independent given W – X à Y is not productive given W à Y

•  If Z and Y are independent, then


a =(a+ b)(a+ c)

n, b =

(a+ b)(b+ d)

n



c =(c+ d)(a+ c)

n, d =

(c+ d)(b+ d)

n

oddratio =ad

bc

= 1

Relation between a and b, c, d

•  Assumption: the row and column marginals are fixed

•  The value of a uniquely determines b, c, and d




19

Probability Mass Function of a

•  The probability mass function of observing the value of a in the contingency table is given by the Hypergeometric distribution – The probability of choosing s successes in t

trials using sampling without replacement from a finite population of size T that has S success in total


P (s | t, S, T ) =

✓Ss

◆·✓

T � St� s

◆

✓Tt

◆

Probability Mass Function of a

•  An occurrence of Z – a success •  T = sup(W) = n •  W always occurs, the total number of

successes = sup(Z|W) à S = a + b, t = a + c


P (a | a+ c, a+ b, n) =

✓a+ ba

◆·✓

n� (a+ b)(a+ c)� a

◆

✓n

a+ c

◆ =

✓a+ ba

◆·✓

c+ dc

◆

✓n

a+ c

◆

=(a+ b)!(c+ d)!(a+ c)!(b+ d)!

n!a!b!c!d!

Calculating the p-value

•  Assuming that the null hypothesis is true, the p-value is the probability of obtaining a test statistic at least as extreme as the one actually observed

•  If p-value is very small (e.g., 0.01), the null hypothesis can be rejected


p� value(a) =

min(b,c)X

i=0

P (a+ i | a+ c, a+ b, n)

=

min(b,c)X

i=0

(a+ b)!(c+ d)!(a+ c)!(b+ d)!

n!(a+ i)!(b� i)!(c� i)!(d+ i)!

Permutation (Randomization) Test

•  Determine the distribution of a given test statistic by randomly modifying the observed data several times to obtain a random sample of the data sets – The modified data sets are used for significance

testing •  Compute the empirical probability mass

function (EPMF) •  Generate the empirical cumulative

distribution function Jian Pei: Big Data Analytics -- Frequent Pattern Mining 112

Compute p-value on Statistics

•  The empirical cumulative distribution function


F (x) = P (⇥ x) =1

k

kX

i=1

I(✓i x)

p� value(✓) = 1� F (✓)

Swap Randomization

•  In permutation test, what characteristics should be preserved in permutation?

•  Swap randomization keep the column and row marginals invariant – The support of each item does not change – The length of each transaction does not change

•  Swap two items in two transactions •  Conduct a certain number of swaps to make

up a new data set


20

Bootstrap Sampling

•  A transaction database is just a sample from a larger population – What is the frequency (or, range of possible

frequency) of X in the underlying population? •  Given a test assessment statistic θ, how can

we infer the confidence interval for the possible values of θ at a desired confidence interval α?

•  Bootstrap sampling: sampling with replacement


Calculating Statistic Range


F (x) = P (⇥ x) =1

k

kX

i=1

I(✓i x)

let v 1�↵2

= F�1(1� ↵

2), v 1+↵

2= F�1(

1 + ↵

2),

P (⇥ 2 [v 1�↵2

, v 1+↵2]) = F (

1 + ↵

2)� F (

1� ↵

2)

=1 + ↵

2� 1� ↵

2= ↵

Thus, the ↵ confidence interval for test statistic ⇥ is [v 1�↵2

, v 1+↵2]


From Itemsets to Sequences •  Itemsets: combinations of items, no temporal order •  Temporal order is important in many situations

–  Time-series databases and sequence databases –  Frequent patterns à (frequent) sequential patterns

•  Applications of sequential pattern mining –  Customer shopping sequences:

•  First buy computer, then iPod, and then digital camera, within 3 months.

–  Medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures


What Is Sequential Pattern Mining?

•  Given a set of sequences, find the complete set of frequent subsequences

A sequence database A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>


Challenges in Seq Pat Mining

•  A huge number of possible sequential patterns are hidden in databases

•  A mining algorithm should – Find the complete set of patterns satisfying the

minimum support (frequency) threshold – Be highly efficient, scalable, involving only a

small number of database scans – Be able to incorporate various kinds of user-

specific constraints


Apriori Property of Seq Patterns

•  Apriori property in sequential patterns –  If a sequence S is infrequent, then none of the

super-sequences of S is frequent – E.g, <hb> is infrequent à so do <hab> and

<(ah)b>

Given support threshold min_sup =2

Seq-id Sequence

10 <(bd)cb(ac)>

20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

21


GSP

•  GSP (Generalized Sequential Pattern) mining •  Outline of the method

–  Initially, every item in DB is a candidate of length-1 –  For each level (i.e., sequences of length-k) do

•  Scan database to collect support count for each candidate sequence

•  Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori

–  Repeat until no frequent sequence or no candidate can be found

•  Major strength: Candidate pruning by Apriori


Finding Len-1 Seq Patterns

•  Initial candidates – <a>, , <c>, <d>, <e>, <f>, <g>,

<h> •  Scan database once

– count support for candidates

min_sup =2

Cand Sup <a> 3 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1

Seq-id Sequence

10 <(bd)cb(ac)>

20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>


Generating Length-2 Candidates <a> <c> <d> <e> <f>

<a> <aa> <ab> <ac> <ad> <ae> <af> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff>

<a> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>

51 length-2 Candidates

Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes

44.57% candidates Jian Pei: Big Data Analytics -- Frequent Pattern Mining 124

Finding Len-2 Seq Patterns

•  Scan database one more time, collect support count for each length-2 candidate

•  There are 19 length-2 candidates which pass the minimum support threshold – They are length-2 sequential patterns


Generating Length-3 Candidates and Finding Length-3 Patterns •  Generate Length-3 Candidates

– Self-join length-2 sequential patterns •  <ab>, <aa> and <ba> are all length-2 sequential

patterns à <aba> is a length-3 candidate •  <(bd)>, <bb> and <db> are all length-2 sequential

patterns à <(bd)b> is a length-3 candidate – 46 candidates are generated

•  Find Length-3 Sequential Patterns – Scan database once more, collect support

counts for candidates – 19 out of 46 candidates pass support threshold


The GSP Mining Process

<a> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1st scan: 8 cand. 6 length-1 seq. pat.

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

min_sup =2

Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

22


The GSP Algorithm

•  Take sequences in form of <x> as length-1 candidates

•  Scan database once, find F1, the set of length-1 sequential patterns

•  Let k=1; while Fk is not empty do –  Form Ck+1, the set of length-(k+1) candidates from Fk; –  If Ck+1 is not empty, scan database once, find Fk+1, the

set of length-(k+1) sequential patterns –  Let k=k+1;


Bottlenecks of GSP

•  A huge set of candidates – 1,000 frequent length-1 sequences generate

length-2 candidates! •  Multiple scans of database in mining •  Real challenge: mining long sequential

patterns – An exponential number of short candidates – A length-100 sequential pattern needs 1030

candidate sequences!

500,499,12999100010001000 =

×+×

30100100

11012

100≈−=⎟⎟

⎠

⎞⎜⎜⎝

⎛∑=i i


FreeSpan: Freq Pat-projected Sequential Pattern Mining •  The itemset of a seq pat must be frequent

– Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns

– Mine each projected database to find its patterns f_list: b:5, c:4, a:3, d:3, e:3, f:2

All seq. pat. can be divided into 6 subsets: • Seq. pat. containing item f • Those containing e but no f • Those containing d but no e nor f • Those containing a but no d, e or f • Those containing c but no a, d, e or f • Those containing only item b

Sequence Database SDB < (bd) c b (ac) > < (bf) (ce) b (fg) > < (ah) (bf) a b f > < (be) (ce) d > < a (bd) b c b (ade) >


From FreeSpan to PrefixSpan

•  Freespan: – Projection-based: no candidate sequence needs

to be generated – But, projection can be performed at any point in

the sequence, and the projected sequences may not shrink much

•  PrefixSpan – Projection-based – But only prefix-based projection: less

projections and quickly shrinking sequences


Prefix and Suffix (Projection)

•  <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>

•  Given sequence <a(abc)(ac)d(cf)>

Prefix Suffix (Prefix-Based Projection)

<a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)>


Mining Sequential Patterns by Prefix Projections •  Step 1: find length-1 sequential patterns

– <a>, , <c>, <d>, <e>, <f> •  Step 2: divide search space. The complete

set of seq. pat. can be partitioned into 6 subsets: – The ones having prefix <a>; – The ones having prefix ; – … – The ones having prefix <f>

23


Finding Seq. Pat. with Prefix <a>

•  Only need to consider projections w.r.t. <a> – <a>-projected database: <(abc)(ac)d(cf)>,

<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> •  Find all the length-2 seq. pat. having prefix

<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> – Further partition into 6 subsets

•  Having prefix <aa>; • … •  Having prefix <af>



Completeness of PrefixSpan

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDB

Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f>

<a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>

Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

-projected database … Having prefix 

Having prefix <c>, …, <f>

… …


Efficiency of PrefixSpan

•  No candidate sequence needs to be generated

•  Projected databases keep shrinking •  Major cost of PrefixSpan: constructing

projected databases – Can be improved by bi-level projections

Effectiveness

•  Redundancy due to anti-monotonicity –  {<abcd>} leads to 15 sequential patterns of

same support – Closed sequential patterns and sequential

generators •  Constraints on sequential patterns

– Gap – Length – More sophisticated, application oriented

constraints


Sequences and Partial Orders

Sequential patterns: CHK à MMK à MORT à RESP CHK à MMK à MORT à BROK CHK à RRSP à MORT à RESP CHK à RRSP à MORT à BROK


Why Frequent Orders?

•  Frequent orders capture more thorough information than sequential patterns

•  Many important applications –  Bioinformatics: order-preserving clustering of microarray

data –  Web mining and market basket analysis: modeling

customer purchase behaviors –  Network management and intrusion detection: frequent

routing paths, signatures for intrusions –  Preference-based services: partial orders from ranking

data

24

Why Mining Orders Difficult?

•  Use sequential patterns to assemble frequent partial orders? – One frequent closed partial order may

summarize a few sequential patterns – Assembling can be costly

Sequential patterns: CHK à MMK à MORT à RESP CHK à MMK à MORT à BROK CHK à RRSP à MORT à RESP CHK à RRSP à MORT à BROK


Model •  A sequence s induces a full order R1, if R1 à R2,

where R2 is a partial order, then R1 is said to support R2

•  The support of a partial order R in a sequence database is the number of sequences supporting R in the database

•  An order R is closed if there exists no any R’ à R and sup(R)=sup(R’)

•  Given a minimum support threshold, order R is a frequent closed partial order if it is closed and passes the support threshold


Ideas

•  Depth-first search to generate frequent closed partial orders in transitive reduction – Transitive reduction is a succinct representation

of partial orders •  Pruning infrequent items, edges and partial

orders •  Pruning forbidden edges •  Extracting transitive reductions of frequent

partial orders directly Jian Pei: Big Data Analytics -- Frequent Pattern Mining 141

Interesting Orders


frequent pattern miningnovel.ict.ac.cn/files/day 3.pdf · design, sale campaign analysis, web log...

Documents