frequent pattern miningnovel.ict.ac.cn/files/day 3.pdf · design, sale campaign analysis, web log...
TRANSCRIPT
1
Frequent Pattern Mining
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 2
Transaction Data Analysis
• Transactions: customers’ purchases of commodities – {bread, milk, cheese} if they are bought together
• Frequent patterns: product combinations that are frequently purchased together by customers
• Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 3
Why Frequent Patterns?
• What products were often purchased together?
• What are the frequent subsequent purchases after buying a iPod?
• What kinds of genes are sensitive to this new drug?
• What key-word combinations are frequently associated with web pages about game-evaluation?
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 4
Why Frequent Pattern Mining?
• Foundation for many data mining tasks – Association rules, correlation, causality,
sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …
• Broad applications – Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, web log (click stream) analysis, …
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 5
Frequent Itemsets
• Itemset: a set of items – E.g., acm = {a, c, m}
• Support of itemsets – Sup(acm) = 3
• Given min_sup = 3, acm is a frequent pattern
• Frequent pattern mining: finding all frequent patterns in a database
TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n
Transaction database TDB
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 6
A Naïve Attempt
• Generate all possible itemsets, test their supports against the database
• How to hold a large number of itemsets into main memory? – 100 items à 2100 – 1 possible itemets
• How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the
support of 220 – 1 = 1,048,575 itemsets
2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 7
Transactions in Real Applications
• A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books
relevant to data mining • Walmart has more than 20 million
transactions per day, AT&T produces more than 275 million calls per day
• Mining large transaction databases of many items is a real demand
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 8
How to Get an Efficient Method?
• Reducing the number of itemsets that need to be checked
• Checking the supports of selected itemsets efficiently
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 9
Candidate Generation & Test
• Any subset of a frequent itemset must also be frequent – an anti-monotonic property – A transaction containing {beer, diaper, nuts} also
contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must
also be frequent • In other words, any superset of an infrequent
itemset must also be infrequent – No superset of any infrequent itemset should be
generated or tested – Many item combinations can be pruned!
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 10
Apriori-Based Mining
• Generate length (k+1) candidate itemsets from length k frequent itemsets, and
• Test the candidates against DB
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 11
The Apriori Algorithm [AgSr94]
TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2
Itemset Sup a 2 b 3 c 3 d 1 e 3
Data base D 1-candidates
Scan D
Itemset Sup a 2 b 3 c 3 e 3
Freq 1-itemsets Itemset
ab ac ae bc be ce
2-candidates
Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2
Counting
Scan D
Itemset Sup ac 2 bc 2 be 3 ce 2
Freq 2-itemsets Itemset
bce
3-candidates
Itemset Sup bce 2
Freq 3-itemsets
Scan D
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 12
The Apriori Algorithm Level-wise, candidate generation and test • Ck: Candidate itemset of size k • Lk : frequent itemset of size k
• L1 = {frequent items}; • for (k = 1; Lk !=∅; k++) do
– Ck+1 = candidates generated from Lk; – for each transaction t in database do increment the
count of all candidates in Ck+1 that are contained in t – Lk+1 = candidates in Ck+1 with min_support
• return ∪k Lk;
Candidate generation
Test
3
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 13
Important Steps in Apriori
• How to find frequent 1- and 2-itemsets? • How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning • How to count supports of candidates?
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 14
Finding Frequent 1- & 2-itemsets
• Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array – Initialize c[item]=0 for each item – For each transaction T, for each item in T,
c[item]++; – If c[item]>=min_sup, item is frequent
• Finding frequent 2-itemsets using a 2-dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 15
Counting Array
• A 2-dimensional triangle matrix can be implemented using a 1-dimensional array
1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5
There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/2+5-3]=c[9]
1 2 3 4 5 6 7 8 9 10 Jian Pei: Big Data Analytics -- Frequent Pattern Mining 16
Example of Candidate-generation
• L3 = {abc, abd, acd, ace, bcd} • Self-joining: L3*L3
– abcd ß abc * abd – acde ß acd * ace
• Pruning: – acde is removed because ade is not in L3
• C4={abcd}
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 17
How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-join Lk-1
INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
• Step 2: pruning – For each itemset c in Ck do
• For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 18
How to Count Supports?
• Why is counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates
• Method – Candidate itemsets are stored in a hash-tree – A leaf node of hash-tree contains a list of itemsets and
counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a
transaction
4
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 19
Example: Counting Supports
1,4,7 2,5,8
3,6,9 Subset function
2 3 4 5 6 7
1 4 5 1 3 6
1 2 4 4 5 7 1 2 5
4 5 8 1 5 9
3 4 5 3 5 6 3 5 7 6 8 9
3 6 7 3 6 8
Transaction: 1 2 3 5 6
1 + 2 3 5 6
1 2 + 3 5 6
1 3 + 5 6
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 20
Apriori in SQL
• Impossible to get good performance out of pure SQL (SQL-92) based approaches alone – Support counting is costly
• Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. – Get orders of magnitude improvement
• S. Sarawagi, S. Thomas, and R. Agrawal, 1998
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 21
Challenges of Freq Pat Mining
• Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for
candidates
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 22
Improving Apriori: Ideas
• Reducing the number of transaction database scans
• Shrinking the number of candidates • Facilitating support counting of candidates
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 23
DIC: Reducing Number of Scans ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{} Itemset lattice
• Once both A and D are determined frequent, the counting of AD can begin
• Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin
Transactions 1-itemsets 2-itemsets
… Apriori
1-itemsets 2-items
3-items DIC S. Brin R. Motwani, J. Ullman, and S. Tsur, SIGMOD’97. DIC: Dynamic Itemset Counting
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 24
DHP: Reducing # of Candidates
• A hashing bucket count < min_sup à every candidate in the bucket is infrequent – Candidates: a, b, c, d, e – Hash entries: {ab, ad, ae} {bd, be, de} … – Large 1-itemset: a, b, d, e – The sum of counts of {ab, ad, ae} < min_sup à
ab should not be a candidate 2-itemset • J. Park, M. Chen, and P. Yu, SIGMOD’95
– DHP: Direct Hashing and Pruning
5
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 25
A 2-Scan Method by Partitioning • Partition the database into n partitions, such that
each partition can be held into main memory • Itemset X is frequent à X must be frequent in at
least one partition – Scan 1: partition database and find local frequent
patterns – Scan 2: consolidate global frequent patterns
• All local frequent itemsets can be held in main memory? A sometimes too strong assumption
• A. Savasere, E. Omiecinski, and S. Navathe, VLDB’95
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 26
Sampling for Frequent Patterns
• Select a sample of the original database, mine frequent patterns in the sample using Apriori
• Scan database once more to verify frequent itemsets found in the sample, only borders of closure of frequent patterns are checked – Example: check abcd instead of ab, ac, …, etc.
• Scan database again to find missed frequent patterns
• H. Toivonen, VLDB’96
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 27
Eclat/MaxEclat and VIPER
• Tid-list: the list of transaction-ids containing an itemset – Vertical Data Format
• Major operation: intersections of tid-lists • Compression of tid-lists
– Itemset A: t1, t2 t3, sup(A)=3 – Itemset B: t2, t3, t4, sup(B)=3 – Itemset AB: t2, t3, sup(AB)=2
• M. Zaki et al., 1997 • P. Shenoy et al., 2000
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 28
Bottleneck of Freq Pattern Mining
• Multiple database scans are costly • Mining long patterns needs many scans and
generates many candidates – To find frequent itemset i1i2…i100
• # of scans: 100 • # of Candidates:
– Bottleneck: candidate-generation-and-test • Can we avoid candidate generation?
30100 1027.112100100
2100
1100
×≈−=⎟⎟⎠
⎞⎜⎜⎝
⎛++⎟⎟
⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛!
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 29
Search Space of Freq. Pat. Mining
• Itemsets form a lattice ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Itemset lattice Jian Pei: Big Data Analytics -- Frequent Pattern Mining 30
Set Enumeration Tree
• Use an order on items, enumerate itemsets in lexicographic order – a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d
• Reduce a lattice to a tree ∅
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd
Set enumeration tree
6
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 31
Borders of Frequent Itemsets
• Frequent itemsets are connected – ∅ is trivially frequent – X on the border à every subset of X is frequent
∅
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd Jian Pei: Big Data Analytics -- Frequent Pattern Mining 32
Projected Databases
• To test whether Xy is frequent, we can use the X-projected database – The sub-database of transactions containing X – Check whether item y is frequent in X-projected
database ∅
a b c d ab ac ad bc bd cd
abc abd acd bcd
abcd
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 33
Compress Database by FP-tree • The 1st scan: find
frequent items – Only record frequent
items in FP-tree – F-list: f-c-a-b-m-p
• The 2nd scan: construct tree – Order frequent items in
each transaction w.r.t. f-list
– Explore sharing among transactions
root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Header table item
f c a b m p
TID Items bought (ordered) freq items
100 f, a, c, d, g, I, m, p f, c, a, m, p
200 a, b, c, f, l,m, o f, c, a, b, m
300 b, f, h, j, o f, b
400 b, c, k, s, p c, b, p
500 a, f, c, e, l, p, m, n f, c, a, m, p
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 34
Benefits of FP-tree
• Completeness – Never break a long pattern in any transaction – Preserve complete information for freq pattern mining
• Not scan database anymore
• Compactness – Reduce irrelevant info — infrequent items are removed – Items in frequency descending order (f-list): the more
frequently occurring, the more likely to be shared – Never be larger than the original database (not counting
node-links and the count fields)
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 35
Partitioning Frequent Patterns
• Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f
• Depth-first search of a set enumeration tree – The partitioning is complete and does not have
any overlap Jian Pei: Big Data Analytics -- Frequent Pattern Mining 36
• Only transactions containing p are needed • Form p-projected database
– Starting at entry p of the header table – Follow the side-link of frequent item p – Accumulate all transformed prefix paths of p
Find Patterns Having Item “p”
root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Header table item
f c a b m p
p-projected database TDB|p fcam: 2 cb: 1
Local frequent item: c:3 Frequent patterns containing p
p: 3, pc: 3
7
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 37
Find Pat Having Item m But No p
• Form m-projected database TDB|m – Item p is excluded (why?) – Contain fca:2, fcab:1 – Local frequent items: f, c, a
• Build FP-tree for TDB|m root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Header table item
f c a b m p
Header table item
f c a
root
f:3
c:3
a:3
m-projected FP-tree
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 38
Recursive Mining
• Patterns having m but no p can be mined recursively
• Optimization: enumerate patterns from a single-branch FP-tree – Enumerate all combination – Support = that of the last item
• m, fm, cm, am • fcm, fam, cam • fcam
Header table item
f c a
root
f:3
c:3
a:3
m-projected FP-tree
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 39
Enumerate Patterns From Single Prefix of FP-tree • A (projected) FP-tree has a single prefix
– Reduce the single prefix into one node – Join the mining results of the two parts
Ú a2:n2
a3:n3
a1:n1
root
b1:m1 c1:k1
c2:k2 c3:k3
+ a2:n2
a3:n3
a1:n1
root
r =
r1
b1:m1 c1:k1
c2:k2 c3:k3
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 40
FP-growth
• Pattern-growth: recursively grow frequent patterns by pattern and database partitioning
• Algorithm – For each frequent item, construct its projected
database, and then its projected FP-tree – Repeat the process on each newly created projected
FP-tree – Until the resulted FP-tree is empty, or contains only one
path – single path generates all the combinations, each of which is a frequent pattern
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 41
Scaling up by DB Projection
• What if an FP-tree cannot fit into memory? • Database projection
– Partition a database into a set of projected databases
– Construct and mine FP-tree once the projected database can fit into main memory
• Heuristic: Projected database shrinks quickly in many applications
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 42
Parallel vs. Partition Projection
• Parallel projection: form all projected database at a time
• Partition projection: propagate projections
Tran. DB fcamp fcabm fb cbp fcamp
p-proj DB fcam cb fcam
m-proj DB fcab fca fca
b-proj DB f cb …
a-proj DB fc …
c-proj DB f …
f-proj DB …
am-proj DB fc fc fc
cm-proj DB f f f
…
8
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 43
Why Is FP-growth Efficient?
• Divide-and-conquer strategy – Decompose both the mining task and DB – Lead to focused search of smaller databases
• Other factors – No candidate generation nor candidate test – Database compression using FP-tree – No repeated scan of entire database – Basic operations – counting local frequent items
and building FP-tree, no pattern search nor pattern matching
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 44
Major Costs in FP-growth
• Poor locality of FP-trees – Low hit rate of cache
• Building FP-trees – A stack of FP-trees
• Redundant information – Transaction abcd appears in a-, ab-, abc-, ac-, …, c- projected databases and FP-trees
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 45
Improving Locality
• Store FP-trees in pre-order depth-first traverse list
Ghoting et al., VLDB05
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 46
H-Mine
• Goal: efficient in various occasions – Dense vs. sparse, huge vs. memory-based data
sets • Moderate in space requirement • Highlights
– Effective and efficient memory-based structure and mining algorithm
– Scalable algorithm for mining large databases by proper partitioning
– Integration of H-mine and FP-growth
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 47
H-Structure • Store frequent-item projections in main memory
– Items in a transaction are sorted according to f-list – Each frequent item in a transaction is stored with two
fields: item-id and hyper-link – Header table H
• Link transactions with same first item • Scan database once Tid Items Freq-item
projection 100 c, d, e, f, g, i c, d, e, g 200 a, c, d, e, m a, c, d, e 300 a, b, d, e, g, k a, d, e, g 400 a, c, d, h a, c, d
F-list = a-c-d-e-g
Headertable H
frequentprojections
400
300
200
100 c d e g
edca
d
da
c
3 2a c e gd
3 4 3
ge
a
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 48
Find Patterns Containing Item “a”
• Only search a-projected database: transactions containing “a”
• The a-queue links all transactions in a-projected database – Can be traversed efficiently
Headertable H
frequentprojections
400
300
200
100 c d e g
edca
d
da
c
3 2a c e gd
3 4 3
ge
a
9
• Build a-header table Ha • Traverse a-queue once, find all local frequent
items within a-projected database
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 49
Mining a-Projected Database
– Local freq items: c, d, and e
– Patterns: ac, ad and ae
• Link transactions having same next frequent item
3 2a c e gd
3 4 3Headertable H
frequentprojections
400
300
200
100 c d e g
edca
d
da
c
ge
a
1c e gd2 3
Headertable Ha 2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 50
Why Is H-Mine(Mem) Efficient?
• No candidate generation – It is a pattern growth method
• Search confined in a dedicated space – Not physically construct memory structures for
projected databases – H-struct is for all the mining – Information about projected databases are
collected in header tables • No frequent patterns stored in main memory
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 51
Mining in Large Databases
• What if the H-struct is too big for memory? • Find global frequent items • Partition the database into n parts
– The H-struct for each part can be held into memory
– Mine local patterns in each part using H-mine(Mem)
• Use relative minimum support threshold
• Consolidate global patterns in the third scan Jian Pei: Big Data Analytics -- Frequent Pattern Mining 52
How to Partition in H-mine?
• Partitioning in H-mine is straightforward – Overhead of header tables in H-mine(Mem) is
small and predictable – Partitioning with Apriori is not easy
• Hard to predict the space requirement of Apriori
• Global frequent items prune many local patterns in skewed partitions
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 53
Mining Dense Projected DB’s
• Challenges in dense datasets – Long patterns – Some patterns appearing in many transactions
• After projection, projected databases are denser • Advantages of FP-tree
– Compress dense databases many times – No candidate generation – Sub-patterns can be enumerated from long patterns
• Build FP-tree for dense projected databases – Empirical switching point: 1%
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 54
Advantages of H-Mine
• Have very small space overhead • Absorb nice features of FP-growth • Create no physical projected database • Watch the density of projected databases,
turn to FP-growth when necessary • Propose space-preserving mining
– Scalable in very large database – Feasible even with very small memory – Go beyond frequent pattern mining
10
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 55
Further Developments
• OP – opportunistic projection (LPWH02) – Opportunistically choose between array-based
and tree-based representations of projected databases
• Diffsets for vertical mining (ZaGo03) – Only record the differences in the tids of a
candidate pattern from its generating frequent patterns
Effectiveness of Freq Pat Mining
• Too many patterns! – A pattern a1a2…an contains 2n-1 subpatterns – Understanding many patterns is difficult or even
impossible for human users • Non-focused mining
– A manager may be only interested in patterns involving some items (s)he manages
– A user is often interested in patterns satisfying some constraints
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 56
Itemset Lattice ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Tid transaction
10 ABD
20 ABC
30 AD
40 ABCD
50 CD
Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD, ACD
Min_sup=2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 57
Max-Patterns ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Tid transaction
10 ABD
20 ABC
30 AD
40 ABCD
50 CD
Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD
Min_sup=2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 58
Borders and Max-patterns
• Max-patterns: borders of frequent patterns – Any subset of max-pattern is frequent – Any superset of max-pattern is infrequent – Cannot generate rules ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{} Jian Pei: Big Data Analytics -- Frequent Pattern Mining 59
MaxMiner: Mining Max-patterns
• 1st scan: find frequent items – A, B, C, D, E
• 2nd scan: find support for – AB, AC, AD, AE, ABCDE – BC, BD, BE, BCDE – CD, CE, CDE, DE,
• Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan
• Bayardo, SIGMOD’98
Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F
Potential max-patterns
Min_sup=2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 60
11
Patterns and Support Counts ABCD
ABC:2 ABD:2 ACD BCD
AB:3 AC:2 BC:2 AD:3 BD:2 CD:2
A:4 B:4 C:3 D:4
{}
Tid transaction
10 ABD
20 ABC
30 AD
40 ABCD
50 CD
Len Frequent itemsets 1 A:4, B:4, C:3, D:4 2 AB:3, AC:2, AD:3, BC:3, BD:2, CD:2 3 ABC:2, ABD:2
Min_sup=2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 61
Frequent Closed Patterns
• For frequent itemset X, if there exists no item y not in X s.t. every transaction containing X also contains y, then X is a frequent closed pattern – “acdf” is a frequent closed pattern
• Concise rep. of freq pats – Can generate non-redundant rules
• Reduce # of patterns and rules • N. Pasquier et al. In ICDT’99
TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f
Min_sup=2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 62
CLOSET for Frequent Closed Patterns
• Flist: list of all freq items in support asc. order – Flist: d-a-f-e-c
• Divide search space – Patterns having d – Patterns having d but no a, etc.
• Find frequent closed pattern recursively – Every transaction having d also has cfa à cfad is a
frequent closed pattern • PHM’00
TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f
Min_sup=2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 63
The CHARM Method
• Use vertical data format: t(AB)={T1, T12, …} • Derive closed pattern based on vertical
intersections – t(X)=t(Y): X and Y always happen together – t(X)⊂t(Y): transaction having X always has Y
• Use diffset to accelerate mining – Only keep track of difference of tids – t(X)={T1, T2, T3}, t(Xy )={T1, T3} – Diffset(Xy, X)={T2}
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 64
Closed and Max-patterns
• Closed pattern mining algorithms can be adapted to mine max-patterns – A max-pattern must be closed
• Depth-first search methods have advantages over breadth-first search ones – Why?
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 65
Condensed Freq Pattern Base
• Practical observation: in many applications, a good approximation on support count could be good enough – Support=10000 à Support in range 10000 ± 1%
• Making frequent pattern mining more realistic – A small deviation has a minor effect on analysis – Condensed FP-base leads to more effective mining – Computing a condensed FP-base may lead to more
efficient mining
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 66
12
Condensed FP-base Mining
• Compute a condensed FP-base with a guaranteed maximal error bound.
• Given: a transaction database, a user-specified support threshold, and a user-specified error bound
• Find a subset of frequent patterns & a function – Determine whether a pattern is frequent – Determine the support range
• Pei et al. ICDM’02
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 67
An Example Support threshold: min_sup
= 1 Error bound: k = 2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 68
Another Base
Support threshold: min_sup = 1 Error bound: k = 2
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 69
Approximation Functions
• NOT unique – Different condensed FP-bases have different
approximation function • Optimization on space requirement
– The less space required, the better compression effect
– compression ratio
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 70
Constraint-based Data Mining
• Find all the patterns in a database autonomously? – The patterns could be too many but not focused!
• Data mining should be interactive – User directs what to be mined
• Constraint-based mining – User flexibility: provides constraints on what to be mined – System optimization: push constraints for efficient
mining
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 71
Constraints in Data Mining
• Knowledge type constraint – classification, association, etc.
• Data constraint — using SQL-like queries – find product pairs sold together in stores in New York
• Dimension/level constraint – in relevance to region, price, brand, customer category
• Rule (or pattern) constraint – small sales (price < $10) triggers big sales (sum >$200)
• Interestingness constraint – strong rules: support and confidence
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 72
13
Constrained Mining vs. Search
• Constrained mining vs. constraint-based search – Both aim at reducing search space – Finding all patterns vs. some (or one) answers satisfying
constraints – Constraint-pushing vs. heuristic search – An interesting research problem on integrating both
• Constrained mining vs. DBMS query processing – Database query processing requires to find all – Constrained pattern mining shares a similar philosophy
as pushing selections deeply in query processing
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 73
Optimization
• Mining frequent patterns with constraint C – Sound: only find patterns satisfying the constraints C – Complete: find all patterns satisfying the constraints C
• A naïve solution – Constraint test as a post-processing
• More efficient approaches – Analyze the properties of constraints – Push constraints as deeply as possible into frequent
pattern mining
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 74
Anti-Monotonicity
• Anti-monotonicity – An intemset S violates the constraint, so does
any of its superset – sum(S.Price) ≤ v is anti-monotone – sum(S.Price) ≥ v is not anti-monotone
• Example – C: range(S.profit) ≤ 15 – Itemset ab violates C – So does every superset of ab
TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g
TDB (min_sup=2)
Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 75
Anti-monotonic Constraints Constraint Antimonotone
v ∈ S No S ⊆ V no S ⊆ V yes
min(S) ≤ v no min(S) ≥ v yes max(S) ≤ v yes max(S) ≥ v no
count(S) ≤ v yes count(S) ≥ v no
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no
range(S) ≤ v yes range(S) ≥ v no
avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ yes support(S) ≤ ξ no
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 76
Monotonicity
• Monotonicity – An intemset S satisfies the constraint, so does
any of its superset – sum(S.Price) ≥ v is monotone – min(S.Price) ≤ v is monotone
• Example – C: range(S.profit) ≥ 15 – Itemset ab satisfies C – So does every superset of ab
TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g
TDB (min_sup=2)
Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 77
Monotonic Constraints Constraint Monotone
v ∈ S yes S ⊆ V yes S ⊆ V no
min(S) ≤ v yes min(S) ≥ v no max(S) ≤ v no max(S) ≥ v yes
count(S) ≤ v no count(S) ≥ v yes
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) no sum(S) ≥ v ( a ∈ S, a ≥ 0 ) yes
range(S) ≤ v no range(S) ≥ v yes
avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ no support(S) ≤ ξ yes
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 78
14
Converting “Tough” Constraints
• Convert tough constraints into anti-monotone or monotone by properly ordering items
• Examine C: avg(S.profit) ≥ 25 – Order items in value-descending order
• <a, f, g, d, b, h, c, e>
– If an itemset afb violates C • So does afbh, afb* • It becomes anti-monotone!
TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g
TDB (min_sup=2)
Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 79
Convertible Constraints
• Let R be an order of items • Convertible anti-monotone
– If an itemset S violates a constraint C, so does every itemset having S as a prefix w.r.t. R
– Ex. avg(S) ≤ v w.r.t. item value descending order • Convertible monotone
– If an itemset S satisfies constraint C, so does every itemset having S as a prefix w.r.t. R
– Ex. avg(S) ≥ v w.r.t. item value descending order
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 80
Strongly Convertible Constraints • avg(X) ≥ 25 is convertible anti-monotone
w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> – Itemset af violates a constraint C, so does
every itemset with af as prefix, such as afd • avg(X) ≥ 25 is convertible monotone
w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> – Itemset d satisfies a constraint C, so does
itemsets df and dfa, which having d as a prefix
• Thus, avg(X) ≥ 25 is strongly convertible
Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 81
Convertible Constraints
Constraint Convertible anti-monotone
Convertible monotone
Strongly convertible
avg(S) ≤ , ≥ v Yes Yes Yes
median(S) ≤ , ≥ v Yes Yes Yes
sum(S) ≤ v (items could be of any value, v ≥ 0) Yes No No
sum(S) ≤ v (items could be of any value, v ≤ 0) No Yes No
sum(S) ≥ v (items could be of any value, v ≥ 0) No Yes No
sum(S) ≥ v (items could be of any value, v ≤ 0) Yes No No
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 82
Can Apriori Handle Convertible Constraint? • A convertible, not monotone nor anti-
monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm – Within the level wise framework, no direct
pruning based on the constraint can be made – Itemset df violates constraint C: avg(X)>=25 – Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned • But it can be pushed into frequent-pattern
growth framework!
Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 83
Mining With Convertible Constraints • C: avg(S.profit) ≥ 25 • List of items in every transaction in
value descending order R: – <a, f, g, d, b, h, c, e> – C is convertible anti-monotone w.r.t. R
• Scan transaction DB once – remove infrequent items
• Item h in transaction 40 is dropped
– Itemsets a and f are good
TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e
TDB (min_sup=2)
Item Profit a 40 f 30 g 20 d 10 b 0 h -10 c -20 e -30
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 84
15
Not Every Pattern Is Interesting!
• Trivial patterns – Pregnant à Female 100% confidence
• Misleading patterns – Play basketball à eat cereal [40%, 66.7%]
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 85
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
Evaluation Criteria
• Objective interestingness measures – Examples: support, patterns formed by mutually
independent items – Domain independent
• Subjective measures – Examples: domain knowledge, templates/
constraints
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 86
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 87
Correlation and Lift
• P(B|A)/P(B) is called the lift of rule A à B
• Play basketball à eat cereal (lift: 0.89) • Play basketball à not eat cereal (lift: 1.33)
corrA,B =P(A∪B)P(A)P(B)
=P(AB)
P(A)P(B)
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
Contingency table 372 Chapter 6 Association Analysis
Table 6.7. A 2-way contingency table for variables A and B.
B B
A f11 f10 f1+
A f01 f00 f0+
f+1 f+0 N
counts tabulated in a contingency table. Table 6.7 shows an example of acontingency table for a pair of binary variables, A and B. We use the notationA (B) to indicate that A (B) is absent from a transaction. Each entry fij inthis 2 × 2 table denotes a frequency count. For example, f11 is the number oftimes A and B appear together in the same transaction, while f01 is the num-ber of transactions that contain B but not A. The row sum f1+ representsthe support count for A, while the column sum f+1 represents the supportcount for B. Finally, even though our discussion focuses mainly on asymmet-ric binary variables, note that contingency tables are also applicable to otherattribute types such as symmetric binary, nominal, and ordinal variables.
Limitations of the Support-Confidence Framework Existing associa-tion rule mining formulation relies on the support and confidence measures toeliminate uninteresting patterns. The drawback of support was previously de-scribed in Section 6.8, in which many potentially interesting patterns involvinglow support items might be eliminated by the support threshold. The draw-back of confidence is more subtle and is best demonstrated with the followingexample.
Example 6.3. Suppose we are interested in analyzing the relationship be-tween people who drink tea and coffee. We may gather information about thebeverage preferences among a group of people and summarize their responsesinto a table such as the one shown in Table 6.8.
Table 6.8. Beverage preferences among a group of 1000 people.
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
Property of Lift
• If A and B are independent, lift = 1 • If A and B are positively correlated, lift > 1 • If A and B are negatively correlated, lift < 1 • Limitation: lift is sensitive to P(A) and P(B)
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 88
374 Chapter 6 Association Analysis
Table 6.9. Contingency tables for the word pairs ({p,q} and {r,s}.
p p r r
q 880 50 930 s 20 50 70
q 50 20 70 s 50 880 930
930 70 1000 70 930 1000
This equation follows from the standard approach of using simple fractionsas estimates for probabilities. The fraction f11/N is an estimate for the jointprobability P (A, B), while f1+/N and f+1/N are the estimates for P (A) andP (B), respectively. If A and B are statistically independent, then P (A, B) =P (A) × P (B), thus leading to the formula shown in Equation 6.6. UsingEquations 6.5 and 6.6, we can interpret the measure as follows:
I(A, B)
⎧⎨
⎩
= 1, if A and B are independent;> 1, if A and B are positively correlated;< 1, if A and B are negatively correlated.
(6.7)
For the tea-coffee example shown in Table 6.8, I = 0.150.2×0.8 = 0.9375, thus sug-
gesting a slight negative correlation between tea drinkers and coffee drinkers.
Limitations of Interest Factor We illustrate the limitation of interestfactor with an example from the text mining domain. In the text domain, itis reasonable to assume that the association between a pair of words dependson the number of documents that contain both words. For example, becauseof their stronger association, we expect the words data and mining to appeartogether more frequently than the words compiler and mining in a collectionof computer science articles.
Table 6.9 shows the frequency of occurrences between two pairs of words,{p, q} and {r, s}. Using the formula given in Equation 6.5, the interest factorfor {p, q} is 1.02 and for {r, s} is 4.08. These results are somewhat troublingfor the following reasons. Although p and q appear together in 88% of thedocuments, their interest factor is close to 1, which is the value when p and qare statistically independent. On the other hand, the interest factor for {r, s}is higher than {p, q} even though r and s seldom appear together in the samedocument. Confidence is perhaps the better choice in this situation because itconsiders the association between p and q (94.6%) to be much stronger thanthat between r and s (28.6%).
lift(p, q) < lift(r, s)!
Leverage
• The difference between the observed and expected joint probability of XY assuming X and Y are independent
• An “absolute” measure of the surprisingness of a rule – Should be used together with lift
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 89
leverage(X ! Y ) = P (XY )� P (X)P (Y )
Convinction
• The expected error of a rule
• Consider not only the joint distribution of X and Y
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 90
conv(X ! Y ) =P (X)P (Y )
P (XY )=
1
lift(X ! Y )
16
Odds Ratio
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 91
odds(Y | X) =
P (XY )P (X)
P (XY )P (X)
=P (XY )
P (XY )
odds(Y | X) =
P (XY )P (X)
P (XY )P (X)
=P (XY )
P (XY )
oddsratio(X ! Y ) =odds(Y | X)
odds(Y | X)=
P (XY ) · P (XY )
P (XY ) · P (XY )
χ2
• Suppose attribute A has c distinct values a1, …, ac and attribute B has r distinct values b1, …, br
• The χ2 value (Pearson χ2 statistics) is
– oij and eij are the observed frequency and the expected frequency, respectively, of the joint event aibj, respectively
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 92
�
2 =cX
i=1
rX
j=1
(oij � eij)2
eij
Example
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 93
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
• The χ2 value is greater than 1 • count(basketball, cereal) = 2000 <
expectation (2250) à play basketball and eating cereal are negatively correlated
�2 =(2000� 2250)2
2250+
(1750� 1500)2
1500+
(1000� 750)2
750+
(250� 500)2
500= 277.8
Φ-coefficient
• –1: if A and B perfectly negatively correlated • 1: if A and B perfectly positively correlated • 0: if A and B statistically independent • Drawback: Φ-coefficient puts the same
weight on co-occurrence and co-absence
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 94
� =P (AB)P (AB)� P (AB)P (AB)p
P (A)P (B)P (A)P (B)
IS Measure
• Biased on frequent co-occurrence • Equivalent to cosine similarity for binary
variables (bit vectors) • Geometric mean of rules between a pair of
binary random variables
• Drawback: the value depends on P(A) and P(B) – Similar drawbacks in lift
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 95
IS(A,B) =plift(A,B)P (A,B) =
P (A,B)pP (A)P (B)
IS(A,B) =
sP (A,B)
P (A)
P (A,B)
P (B)=
pconf(A ! B)conf(B ! A)
More Measures
• All confidence: min{ P(A|B), P(B|A) } • Max confidence: max{ P(A|B), P(B|A) } • The Kulczynski measure: ½ (P(A|B) + P(B|
A)
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 96
17
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 97
Comparing Measures Milk No Milk Sum (row)
Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m Σ
Contingency table
Transaction databases and their contingency tables
30CHAPTER 6. MINING FREQUENTPATTERNS, ASSOCIATIONS, AND CORRELATIONS: BASIC
Table 6.9: Comparison of six pattern evaluation measures using contingency tables for a varietyof data sets.
Data Set mc mc mc mc χ2 lift all conf. max conf. Kulc. cosineD1 10,000 1,000 1,000 100,000 90557 9.26 0.91 0.91 0.91 0.91D2 10,000 1,000 1,000 100 0 1 0.91 0.91 0.91 0.91D3 100 1,000 1,000 100,000 670 8.44 0.09 0.09 0.09 0.09D4 1,000 1,000 1,000 100,000 24740 25.75 0.5 0.5 0.5 0.5D5 1,000 100 10,000 100,000 8173 9.18 0.09 0.91 0.5 0.29D6 1,000 10 100,000 100,000 965 1.97 0.01 0.99 0.5 0.10
positively associated in both data sets by producing a measure value of 0.91.However, lift and χ2 generate dramatically different measure values for D1 andD2 due to their sensitivity to mc. In fact, in many real-world scenarios mc isusually huge and unstable. For example, in a market basket database, the totalnumber of transactions could fluctuate on a daily basis and overwhelmingly ex-ceed the number of transactions containing any particular itemset. Therefore, agood interestingness measure should not be affected by transactions that do notcontain the itemsets of interest; otherwise, it would generate unstable results asillustrated in D1 and D2.
Similarly, in D3, the four new measures correctly show that m and c arestrongly negatively associated because the ratio of m to c equals the ratio ofmc to m, that is 100/1100 = 9.1%. However, lift and χ2 both contradict this inan incorrect way: their values for D2 are between those for D1 and D3.
For data set D4, both lift and χ2 indicate a highly positive associationbetween m and c, whereas the others indicate a “neutral” association becausethe ratio of mc to mc equals the ratio of mc to mc, which is 1. This means thatif a customer buys coffee (or milk), the probability that she will also purchasemilk (or coffee) is exactly 50%.
“Why are lift and χ2 so poor at distinguishing pattern association relation-ships in the above transactional data sets?” To answer this, we have to considerthe null-transactions. A null-transaction is a transaction that does not con-tain any of the itemsets being examined. In our example, mc represents thenumber of null-transactions. Lift and χ2 have difficulty distinguishing interest-ing pattern association relationships because they are both strongly influencedby mc. Typically, the number of null-transactions can outweigh the numberof individual purchases, because many people may buy neither milk nor coffee.On the other hand, the other four measures are good indicators of interestingpattern associations because their definitions remove the influence of mc (thatis, they are not influenced by the number of null-transactions).
The above discussion shows that it is highly desirable to have a measurewhose value is independent of the number of null-transactions. A measure isnull-invariant if its value is free from the influence of null-transactions. Null-
χ2 and lift do not perform well on those data sets, since they are sensitive to ~m~c
Imbalance Ration
• Assess the imbalance of two itemsets A and B in rule implications
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 98
IR(A,B) =|P (A)� P (B)|
P (A) + P (B)� P (A [B)
Properties of Measures
• Symmetry: is M(AàB) = M(BàA) • Null-transaction dependent (null addition
invariant): is ~A~B used in the measure? • Inversion invariant: the value does not
change if f11 and f10 are exchanged with f00 and f01
• Scaling: whether the measure remains if the contingency table [f11, f10, f01, f00] is changed to [k1k3f11, k2k3f10, k1k4f01, k2k4f00]?
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 99
Measuring 3 Random Variables
• 3 dimensional contingency table
• For a k-itemset {i1, i2, …, ik}, the condition for statistical independence is
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 100
6.7 Evaluation of Association Patterns 383
Table 6.18. Example of a three-dimensional contingency table.
c b b c b b
a f111 f101 f1+1 a f110 f100 f1+0
a f011 f001 f0+1 a f010 f000 f0+0
f+11 f+01 f++1 f+10 f+00 f++0
such as f1+1 is the number of transactions that contain a and c, irrespectiveof whether b is present in the transaction.
Given a k-itemset {i1, i2, . . . , ik}, the condition for statistical independencecan be stated as follows:
fi1i2...ik =fi1+...+ × f+i2...+ × . . . × f++...ik
Nk−1. (6.12)
With this definition, we can extend objective measures such as interest factorand PS, which are based on deviations from statistical independence, to morethan two variables:
I =Nk−1 × fi1i2...ik
fi1+...+ × f+i2...+ × . . . × f++...ik
PS =fi1i2...ik
N− fi1+...+ × f+i2...+ × . . . × f++...ik
Nk
Another approach is to define the objective measure as the maximum, min-imum, or average value for the associations between pairs of items in a pat-tern. For example, given a k-itemset X = {i1, i2, . . . , ik}, we may define theφ-coefficient for X as the average φ-coefficient between every pair of items(ip, iq) in X. However, because the measure considers only pairwise associa-tions, it may not capture all the underlying relationships within a pattern.
Analysis of multidimensional contingency tables is more complicated be-cause of the presence of partial associations in the data. For example, someassociations may appear or disappear when conditioned upon the value of cer-tain variables. This problem is known as Simpson’s paradox and is describedin the next section. More sophisticated statistical techniques are available toanalyze such relationships, e.g., loglinear models, but these techniques arebeyond the scope of this book.
fi1i2···ik =fi1+···+f+i2···+ · · · f++···ik
Nk�1
Measuring More Random Variables
• Some measures, such as lift and statistical independence, can be extended
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 101
I =Nk�1fi1i2···ik
fi1+···+f+i2···+ · · · f+···+ik
PS =fi1i2···ik
N� fi1+···+f+i2···+ · · · f+···+ik
Nk
Simpson’s Paradox
• A trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data – Also known as Yule-Simpson effect – Often encountered in social-science and
medical-science statistics – Particularly confounding when frequency data
are unduly given causal interpretations
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 102
18
Kidney Stone Treatment Example
• Which treatment, A or B, is better?
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 103
Treatment A Treatment B Small stones G1: 81/87=93% G2: 234/270=87% Large stones G3: 192/263=73% G4: 55/80=69%
Overall 273/350=78% 289/350=83%
Berkeley Gender Bias Case
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 104
Applicants Admitted Men 8442 44%
Women 4321 35%
Department Men Women Applicants Admitted Applicants Admitted
A 825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 272 6% 341 7%
Fisher Exact Test
• Directly test whether a rule X à Y is productive by comparing its confidence with those of its generalizations W à Y, where W is a subset of X – Let X = W U Z
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 105
W Y Not Y Z a b a + b
Not Z c d c + d a + c b + d Sup(W)
a = sup(WZY ) = sup(XY ), b = sup(WZY ) = sup(XY )c = sup(WZY ), d = sup(WZY )
Marginals
• Row marginals
• Column marginals
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 106
W Y Not Y Z a b a + b
Not Z c d c + d a + c b + d Sup(W)
a+ b = sup(WZ) = sup(X), c+ d = sup(WZ)
a+ c = sup(WY ), b+ d = sup(WY )
oddratio =
aa+bb
a+bc
c+dd
c+d
=ad
bc
Hypothesis
• H0: Z and Y are independent given W – X à Y is not productive given W à Y
• If Z and Y are independent, then
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 107
a =(a+ b)(a+ c)
n, b =
(a+ b)(b+ d)
n
W Y Not Y Z a b a + b
Not Z c d c + d a + c b + d Sup(W)
c =(c+ d)(a+ c)
n, d =
(c+ d)(b+ d)
n
oddratio =ad
bc
= 1
Relation between a and b, c, d
• Assumption: the row and column marginals are fixed
• The value of a uniquely determines b, c, and d
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 108
W Y Not Y Z a b a + b
Not Z c d c + d a + c b + d Sup(W)
19
Probability Mass Function of a
• The probability mass function of observing the value of a in the contingency table is given by the Hypergeometric distribution – The probability of choosing s successes in t
trials using sampling without replacement from a finite population of size T that has S success in total
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 109
P (s | t, S, T ) =
✓Ss
◆·✓
T � St� s
◆
✓Tt
◆
Probability Mass Function of a
• An occurrence of Z – a success • T = sup(W) = n • W always occurs, the total number of
successes = sup(Z|W) à S = a + b, t = a + c
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 110
P (a | a+ c, a+ b, n) =
✓a+ ba
◆·✓
n� (a+ b)(a+ c)� a
◆
✓n
a+ c
◆ =
✓a+ ba
◆·✓
c+ dc
◆
✓n
a+ c
◆
=(a+ b)!(c+ d)!(a+ c)!(b+ d)!
n!a!b!c!d!
Calculating the p-value
• Assuming that the null hypothesis is true, the p-value is the probability of obtaining a test statistic at least as extreme as the one actually observed
• If p-value is very small (e.g., 0.01), the null hypothesis can be rejected
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 111
p� value(a) =
min(b,c)X
i=0
P (a+ i | a+ c, a+ b, n)
=
min(b,c)X
i=0
(a+ b)!(c+ d)!(a+ c)!(b+ d)!
n!(a+ i)!(b� i)!(c� i)!(d+ i)!
Permutation (Randomization) Test
• Determine the distribution of a given test statistic by randomly modifying the observed data several times to obtain a random sample of the data sets – The modified data sets are used for significance
testing • Compute the empirical probability mass
function (EPMF) • Generate the empirical cumulative
distribution function Jian Pei: Big Data Analytics -- Frequent Pattern Mining 112
Compute p-value on Statistics
• The empirical cumulative distribution function
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 113
F (x) = P (⇥ x) =1
k
kX
i=1
I(✓i x)
p� value(✓) = 1� F (✓)
Swap Randomization
• In permutation test, what characteristics should be preserved in permutation?
• Swap randomization keep the column and row marginals invariant – The support of each item does not change – The length of each transaction does not change
• Swap two items in two transactions • Conduct a certain number of swaps to make
up a new data set
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 114
20
Bootstrap Sampling
• A transaction database is just a sample from a larger population – What is the frequency (or, range of possible
frequency) of X in the underlying population? • Given a test assessment statistic θ, how can
we infer the confidence interval for the possible values of θ at a desired confidence interval α?
• Bootstrap sampling: sampling with replacement
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 115
Calculating Statistic Range
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 116
F (x) = P (⇥ x) =1
k
kX
i=1
I(✓i x)
let v 1�↵2
= F�1(1� ↵
2), v 1+↵
2= F�1(
1 + ↵
2),
P (⇥ 2 [v 1�↵2
, v 1+↵2]) = F (
1 + ↵
2)� F (
1� ↵
2)
=1 + ↵
2� 1� ↵
2= ↵
Thus, the ↵ confidence interval for test statistic ⇥ is [v 1�↵2
, v 1+↵2]
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 117
From Itemsets to Sequences • Itemsets: combinations of items, no temporal order • Temporal order is important in many situations
– Time-series databases and sequence databases – Frequent patterns à (frequent) sequential patterns
• Applications of sequential pattern mining – Customer shopping sequences:
• First buy computer, then iPod, and then digital camera, within 3 months.
– Medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 118
What Is Sequential Pattern Mining?
• Given a set of sequences, find the complete set of frequent subsequences
A sequence database A sequence : < (ef) (ab) (df) c b >
An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 119
Challenges in Seq Pat Mining
• A huge number of possible sequential patterns are hidden in databases
• A mining algorithm should – Find the complete set of patterns satisfying the
minimum support (frequency) threshold – Be highly efficient, scalable, involving only a
small number of database scans – Be able to incorporate various kinds of user-
specific constraints
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 120
Apriori Property of Seq Patterns
• Apriori property in sequential patterns – If a sequence S is infrequent, then none of the
super-sequences of S is frequent – E.g, <hb> is infrequent à so do <hab> and
<(ah)b>
Given support threshold min_sup =2
Seq-id Sequence
10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>
21
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 121
GSP
• GSP (Generalized Sequential Pattern) mining • Outline of the method
– Initially, every item in DB is a candidate of length-1 – For each level (i.e., sequences of length-k) do
• Scan database to collect support count for each candidate sequence
• Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori
– Repeat until no frequent sequence or no candidate can be found
• Major strength: Candidate pruning by Apriori
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 122
Finding Len-1 Seq Patterns
• Initial candidates – <a>, <b>, <c>, <d>, <e>, <f>, <g>,
<h> • Scan database once
– count support for candidates
min_sup =2
Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1
Seq-id Sequence
10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 123
Generating Length-2 Candidates <a> <b> <c> <d> <e> <f>
<a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff>
<a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>
51 length-2 Candidates
Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes
44.57% candidates Jian Pei: Big Data Analytics -- Frequent Pattern Mining 124
Finding Len-2 Seq Patterns
• Scan database one more time, collect support count for each length-2 candidate
• There are 19 length-2 candidates which pass the minimum support threshold – They are length-2 sequential patterns
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 125
Generating Length-3 Candidates and Finding Length-3 Patterns • Generate Length-3 Candidates
– Self-join length-2 sequential patterns • <ab>, <aa> and <ba> are all length-2 sequential
patterns à <aba> is a length-3 candidate • <(bd)>, <bb> and <db> are all length-2 sequential
patterns à <(bd)b> is a length-3 candidate – 46 candidates are generated
• Find Length-3 Sequential Patterns – Scan database once more, collect support
counts for candidates – 19 out of 46 candidates pass support threshold
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 126
The GSP Mining Process
<a> <b> <c> <d> <e> <f> <g> <h>
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
<abb> <aab> <aba> <baa> <bab> …
<abba> <(bd)bc> …
<(bd)cba>
1st scan: 8 cand. 6 length-1 seq. pat.
2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all
3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. pat.
5th scan: 1 cand. 1 length-5 seq. pat.
Cand. cannot pass sup. threshold
Cand. not in DB at all
min_sup =2
Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>
22
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 127
The GSP Algorithm
• Take sequences in form of <x> as length-1 candidates
• Scan database once, find F1, the set of length-1 sequential patterns
• Let k=1; while Fk is not empty do – Form Ck+1, the set of length-(k+1) candidates from Fk; – If Ck+1 is not empty, scan database once, find Fk+1, the
set of length-(k+1) sequential patterns – Let k=k+1;
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 128
Bottlenecks of GSP
• A huge set of candidates – 1,000 frequent length-1 sequences generate
length-2 candidates! • Multiple scans of database in mining • Real challenge: mining long sequential
patterns – An exponential number of short candidates – A length-100 sequential pattern needs 1030
candidate sequences!
500,499,12999100010001000 =
×+×
30100100
11012
100≈−=⎟⎟
⎠
⎞⎜⎜⎝
⎛∑=i i
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 129
FreeSpan: Freq Pat-projected Sequential Pattern Mining • The itemset of a seq pat must be frequent
– Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns
– Mine each projected database to find its patterns f_list: b:5, c:4, a:3, d:3, e:3, f:2
All seq. pat. can be divided into 6 subsets: • Seq. pat. containing item f • Those containing e but no f • Those containing d but no e nor f • Those containing a but no d, e or f • Those containing c but no a, d, e or f • Those containing only item b
Sequence Database SDB < (bd) c b (ac) > < (bf) (ce) b (fg) > < (ah) (bf) a b f > < (be) (ce) d > < a (bd) b c b (ade) >
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 130
From FreeSpan to PrefixSpan
• Freespan: – Projection-based: no candidate sequence needs
to be generated – But, projection can be performed at any point in
the sequence, and the projected sequences may not shrink much
• PrefixSpan – Projection-based – But only prefix-based projection: less
projections and quickly shrinking sequences
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 131
Prefix and Suffix (Projection)
• <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>
• Given sequence <a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)
<a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)>
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 132
Mining Sequential Patterns by Prefix Projections • Step 1: find length-1 sequential patterns
– <a>, <b>, <c>, <d>, <e>, <f> • Step 2: divide search space. The complete
set of seq. pat. can be partitioned into 6 subsets: – The ones having prefix <a>; – The ones having prefix <b>; – … – The ones having prefix <f>
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
23
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 133
Finding Seq. Pat. with Prefix <a>
• Only need to consider projections w.r.t. <a> – <a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> • Find all the length-2 seq. pat. having prefix
<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> – Further partition into 6 subsets
• Having prefix <aa>; • … • Having prefix <af>
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 134
Completeness of PrefixSpan
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
SDB
Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>
<a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>
Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
Having prefix <a>
Having prefix <aa>
<aa>-proj. db … <af>-proj. db
Having prefix <af>
<b>-projected database … Having prefix <b>
Having prefix <c>, …, <f>
… …
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 135
Efficiency of PrefixSpan
• No candidate sequence needs to be generated
• Projected databases keep shrinking • Major cost of PrefixSpan: constructing
projected databases – Can be improved by bi-level projections
Effectiveness
• Redundancy due to anti-monotonicity – {<abcd>} leads to 15 sequential patterns of
same support – Closed sequential patterns and sequential
generators • Constraints on sequential patterns
– Gap – Length – More sophisticated, application oriented
constraints
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 136
Sequences and Partial Orders
Sequential patterns: CHK à MMK à MORT à RESP CHK à MMK à MORT à BROK CHK à RRSP à MORT à RESP CHK à RRSP à MORT à BROK
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 137
Why Frequent Orders?
• Frequent orders capture more thorough information than sequential patterns
• Many important applications – Bioinformatics: order-preserving clustering of microarray
data – Web mining and market basket analysis: modeling
customer purchase behaviors – Network management and intrusion detection: frequent
routing paths, signatures for intrusions – Preference-based services: partial orders from ranking
data
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 138
24
Why Mining Orders Difficult?
• Use sequential patterns to assemble frequent partial orders? – One frequent closed partial order may
summarize a few sequential patterns – Assembling can be costly
Sequential patterns: CHK à MMK à MORT à RESP CHK à MMK à MORT à BROK CHK à RRSP à MORT à RESP CHK à RRSP à MORT à BROK
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 139
Model • A sequence s induces a full order R1, if R1 à R2,
where R2 is a partial order, then R1 is said to support R2
• The support of a partial order R in a sequence database is the number of sequences supporting R in the database
• An order R is closed if there exists no any R’ à R and sup(R)=sup(R’)
• Given a minimum support threshold, order R is a frequent closed partial order if it is closed and passes the support threshold
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 140
Ideas
• Depth-first search to generate frequent closed partial orders in transitive reduction – Transitive reduction is a succinct representation
of partial orders • Pruning infrequent items, edges and partial
orders • Pruning forbidden edges • Extracting transitive reductions of frequent
partial orders directly Jian Pei: Big Data Analytics -- Frequent Pattern Mining 141
Interesting Orders
Jian Pei: Big Data Analytics -- Frequent Pattern Mining 142