an efficient polynomial delay algorithm for pseudo frequent itemset mining 2/oct/2007 discovery...

An Efficient Polynomial Delay Algorithm for

Pseudo Frequent Itemset Mining

An Efficient Polynomial Delay Algorithm for

Pseudo Frequent Itemset Mining

2/Oct/2007 Discovery Science 2007

Takeaki Uno (National Institute of Informatics)

Hiroki Arimura (Hokkaido University)

Frequent Pattern MiningFrequent Pattern MiningFrequent Pattern MiningFrequent Pattern Mining

•• problem of finding all frequently appearing patterns from (large scale) database

　 database: transaction, tree, string, graph, vector　 pattern: subset, tree, path, sequence, graph, geograph…

Genome infoexperiments

database

ATGCGCCGTATAGCGGGTGGTTCGCGTTAGGGATATAAATGCGCCAAATAATAATGTATTATTGAAGGGCGACAGTCTCTCAATAAGCGGCT

ATGCGCCGTATAGCGGGTGGTTCGCGTTAGGGATATAAATGCGCCAAATAATAATGTATTATTGAAGGGCGACAGTCTCTCAATAAGCGGCT

ex1 ex2 ex3 ex4　● 　▲ 　▲ 　

　● 　▲　● 　● 　▲ 　●　● 　● 　▲ 　●　▲ 　● 　●

　● 　▲ 　●　● 　▲ 　▲　　▲ 　▲ 　

•• ex1● ,ex3 ▲•• ex2● ,ex4●•• ex2●, ex3 ▲, ex4●• • ex2▲ ,ex3 ▲　　　　．　　　　．　　　　．

•• ex1● ,ex3 ▲•• ex2● ,ex4●•• ex2●, ex3 ▲, ex4●• • ex2▲ ,ex3 ▲　　　　．　　　　．　　　　．

•• ATGCAT•• CCCGGGTAA•• GGCGTTA•• ATAAGGG　　　　．　　　　．　　　　．

•• ATGCAT•• CCCGGGTAA•• GGCGTTA•• ATAAGGG　　　　．　　　　．　　　　．

This ResearchThis ResearchThis ResearchThis Research

•• address transaction database

transaction database: transaction database: each record (transaction) T of the database is a subset of the itemset E, 　 i.e., DD, ∀∀T ∈DD, T ⊆ E

frequent itemset: frequent itemset: subset of E included in at least σ transactions

•• problems

- - so many patterns for finding valuable patterns

- - inclusion is strict, to deal with errors

"patterns ambiguously included in many transactions" are impotant

We introduce an ambiguous inclusion, and propose an efficient mining algorithm

We introduce an ambiguous inclusion, and propose an efficient mining algorithm

minimum support thresholdminimum support threshold

Related WorksRelated WorksRelated WorksRelated Works

•• Such frequent itemset mining with ambiguity is called

fault-tolerant pattern, degenerate pattern, soft occurrence

-- ambiguity for inclusion is, "pattern is included if the ratio of

included items is more than the threshold

-- another approach: find combinations of itemset and

transaction set, such that few pairs of item and transaction do

not satisfy inclusion relation

-- similarity is used, for string matching and homology search

•• Few "enumeration type" research with completeness

Look at practical models and algorithms, from algorithm theoryLook at practical models and algorithms, from algorithm theory

Notations for F.I.M.Notations for F.I.M.Notations for F.I.M.Notations for F.I.M.

•• For itemset K,

occurrence of occurrence of K:: transaction of　　D D including K

Occ(K):: occurrence set of occurrence set of K:: the set of occurrences of K

frq(K):: frequency of frequency of K: : the size of Occ(K)

1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92

D ＝＝　

　 Occ( {1,2} )＝＝　 { {1,2,5,6,7,9}, {1,2,7,8,9} }

　 Occ( {2,7,9} )＝＝　 { {1,2,5,6,7,9}, 　　　　 {1,2,7,8,9}, {2,7,9} }

Frequent ItemsetFrequent ItemsetFrequent ItemsetFrequent Itemset

•• Frequent itemset:Frequent itemset: itemset with frequency no less than σ

( σ is called minimum support (threshold) )

Ex.)Ex.) 1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92

DD ＝＝

Itemsets included in no Itemsets included in no less than 3 transactionsless than 3 transactions{1} {2} {7} {9}{1,7} {1,9}{2,7} {2,9} {7,9}{1,7,9} {2,7,9}

Frequent itemset mining: problem of enumerating all frequent itemsets for given database D and minimum support σ

Frequent itemset mining: problem of enumerating all frequent itemsets for given database D and minimum support σ

Inclusion with AmbiguityInclusion with AmbiguityInclusion with AmbiguityInclusion with Ambiguity

•• Ambiguous inclusion relation for itemset P and transaction T

•• Popular definition: |P∩T| ／ |P| θ ≧ for threshold θ<1

　　　　 lose monotonicity of frequent itemsets

　　　　 there is a frequent itemset s.t. "any its subset is infrequent"

　　　　much cost for computation

{1,2,3} ⊆ {1,2,4,5} for θ= 0.6{1,2,3} ⊆ {1,2,4,5} for θ= 0.6

{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for θ= 0.6{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for θ= 0.6

{1,2,3} ⊆ {1,4,5} for θ= 0.6{1,2,3} ⊆ {1,4,5} for θ= 0.6

θ= 0.6{1,2}{2,3}{1,3}

θ= 0.6{1,2}{2,3}{1,3}

{1,2,3} included in allsubset not for any

{1,2,3} included in allsubset not for any

k-pseudo Inclusionk-pseudo Inclusionk-pseudo Inclusionk-pseudo Inclusion

•• Use threshold for #non-included items:

k-pseudo inclusion: |P ＼ T| ≦k for threshold k ≧ 0

( k-pseudo [occurrence / occurrence set / frequency] )

　　　　monotonicity is kept

　　　　 able to find characterizations such as

"many transactions include at least 3 items of P"

{1,2,3} ⊆ {1,2,4,5} for k = 1{1,2,3} ⊆ {1,2,4,5} for k = 1

{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for k = 1{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for k = 1

{1,2,3} ⊆ {1,4,5} for k = 1{1,2,3} ⊆ {1,4,5} for k = 1

kk Pseudo Frequent Itemset Pseudo Frequent Itemsetkk Pseudo Frequent Itemset Pseudo Frequent Itemset

•• k-pseudo frequent itemset:k-pseudo frequent itemset: itemset k-pseudo included in at least σ transactions of DD

1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92

D D ＝＝

1-pseudo frequent itemsets for σ=31-pseudo frequent itemsets for σ=3{1,2,3} {1,2,4} {1,2,5} {1,2,7} {1,2,9} {1,3,7} {1,3,9} {1,4,7} {1,4,9} {1,5,7} {1,5,9} {1,6,7} {1,6,9} {1,7,8} {1,7,9} {1,8,9} {2,3,7} {2,3,9} {2,4,7} {2,4,9} {2,5,7} {2,5,8} {2,5,9} {2,6,7} {2,6,9} {2,7,8} {2,7,9} {2,8,9} {3,7,9} {4,7,9} {5,7,9} {6,7,9} {7,8,9} {1,2,7,9} {1,3,7,9} {1,4,7,9}{1,5,7,9} {1,6,7,9} {1,7,8,9} {2,3,7,9} {2,4,7,9} {2,5,7,9} {2,6,7,9} {2,7,8,9}

Many trivial patternsHow to efficiently enumerate?

Many trivial patternsHow to efficiently enumerate?

Enumeration using MonotonicityEnumeration using MonotonicityEnumeration using MonotonicityEnumeration using Monotonicity

•• Pseudo frequent itemsets have monotone property thereby simple backtrack algorithm work

•• For each k-pseudo frequent itemset P, compute k-pseudo frequency of each P+e

•• If the k-pseudo frequency of P+eis no less than σ, generate recursivecall to enumerate k-pseudo frequent itemsets including P+e

freqfreq

111…1

000…0

φ

1,31,2

1,2,3 1,2,4 1,3,4 2,3,4

1 2 3 4

3,42,41,4 2,3

1,2,3,4

Polynomial time enumerationPolynomial time enumeration

How to efficiently computate?How to efficiently computate?

Computing k-Pseudo OccurrencesComputing k-Pseudo OccurrencesComputing k-Pseudo OccurrencesComputing k-Pseudo Occurrences

•• Define Occ=h(P) = { T∈D | |P ＼ T| = h } 　　 set of transactions missing just h items of P

Occ≦k(P) = ∪h≦kOcc=h(P)

•• Occ=h(P∪e) = Occ=h(P)∩Occ(e) ∪ Occ=h-1(P) ＼ Occ(e)

update of pseudo occurrence set is done by taking intersection

•• compute Occ=h(P)∩Occ(e) for all pair of e and h

ABCDEFG

ABCDEFG

ABCF

ABCF

BCF

BCF

8 9 10 11 12

ABEFG

ABEFG

ABCD

ABCD

CDCD

ABCDF

ABCDF

BACDF

BACDF

ABCDEFG

ABCDEFG

ABCD

ABCD

ABCD

ABCD

Occ0

Occ1

Occ2P

Taking Intersections EfficientlyTaking Intersections EfficientlyTaking Intersections EfficientlyTaking Intersections Efficiently

•• Occ=h(P∪e) = Occ=h(P)∩Occ(e) ∪ Occ=h-1(P) ＼ Occ(e)

having the same properties as usual occurrences

can use many existing techniques for updating occurrence set

(down project, delivery, bitmap…)

•• Database reduction (FP-tree)

is also available

•• In deeper levels of recursion,

transactions to be scanned

becomes few, thereby

the computation is fast

A: 1,2,5,6,7,9B: 2,3,4,5C: 1,2,7,8,9D: 1,7,9E: 2,7,9F: 2

1: A,C,D2: A,B,C,E,F3: B4: B5: A,B6: A7: A,C,D,E8: C9: A,C,D,E

Using Bottom-widenessUsing Bottom-widenessUsing Bottom-widenessUsing Bottom-wideness

•• Backtrack (depth-first search) generates several recursive calls in each iteration

　　　 The computation tree spreads exponentially by going down

The computation time is dominated by the bottom level iterations on the recursion tree

Amortized computation time is reduced to that of bottom levelsAmortized computation time is reduced to that of bottom levels

・・・・・・

long timelong time

short timeshort time

Since occurrences to

be computed

is few in lower levels,

Since occurrences to

be computed

is few in lower levels,

For Large Minimum SupportFor Large Minimum SupportFor Large Minimum SupportFor Large Minimum Support

•• When σ is large, we access many transactions on the bottom levels

Improvements by bottom-wideness is not drastic

•• Reduce the database to speed up the bottoms

(1) (1) Delete items less than the maximum item in P

(2)(2) Delete items being infrequent on the occurrence set database

　 (since it never be added in the recursive call)

(3)(3) unify the same transactions

•• The database size is constant in the

bottom levels in practice

No big difference from small σNo big difference from small σ

1 3 5

1 2 3 4 6

1 7

2 3 4 6 7

3 4 5 6 7

2 3 4 6 7

P={1,3}, k=1, σ=4

•• Under the k-pseudo inclusion, itemsets of size no more than k is included in anyany transaction

•• itemsets of size bit greater than k is also included in many transactions

　　　 Many small and trivial frequent itemsets

•• We want to ignore these itemsets in practice

　　　　Consider problem of directly finding

pseudo frequent itemsets of size l

Small & Trivial PatternsSmall & Trivial PatternsSmall & Trivial PatternsSmall & Trivial Patterns

•• Need exponential time if search all itemsets of size l

　　　　 Pruning unnecessary search is crucial

　　　　 Take candidates according to partial structure

•• Let P be a k-pseudo frequent itemset of size l

•• WLOG, P={1,…,l} and

sorted in decreasing order of |Occ=k(P) ＼ Occ({e})|

•• Consider the (k-1)-pseudo frequency of itemset {1,…,y}

•• Any transaction in Occ=k(P) ＼ Occ({e}), e>y

　 (k-1)-pseudo includes {1,…,y}

Directly Finding Large ItemsetDirectly Finding Large ItemsetDirectly Finding Large ItemsetDirectly Finding Large Itemset

•• Any transaction in Occ=k(P) ＼ Occ({e}), e>y

　 (k-1)-pseudo includes {1,…,y}

|Occk-1({1,…,y})| ≧ |∪e=y+1,...,|P| (Occk(P) ＼ Occ({e}))|

•• average of |Occk(P) ＼ Occ({e})| is no less than (k / |P|) |Occ=k

(P)|

•• 1,…,y are sorted in increasing order of |Occk(P) ＼ Occ({e})|

|Occk-1({1,…,y})| ≧ |Occk(P)|×(|P|-y)/|P|

Search Route to Itemset of Size Search Route to Itemset of Size llSearch Route to Itemset of Size Search Route to Itemset of Size ll

There is a sequence of itemsets from empty set to P composed only of itemsets satisfying partial frequency condition

There is a sequence of itemsets from empty set to P composed only of itemsets satisfying partial frequency condition

Partial frequency condition

Partial frequency condition

Example for Partial Frequency ConditionExample for Partial Frequency ConditionExample for Partial Frequency ConditionExample for Partial Frequency Condition

1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92

D D ＝＝

1-pseudo frequent itemsets1-pseudo frequent itemsets satisfying the partial frequency conditionsatisfying the partial frequency condition{1} {2} {5} {7} {9} {1,2} {1,5} {1,6} {1,7} {1,8} {1,9} {2,3} {2,4} {2,5} {2,6} {2,7} {2,8} {2,9} {3,5} {4,5} {5,6} {5,7} {5,9} {6,7} {6,9} {7,8} {7,9} {8,9}

#frequent itemsets to be searched is decreased, efficient search is expected

#frequent itemsets to be searched is decreased, efficient search is expected

•• Itemsets satisfying the partial frequency condition,

for k=1, σ=3, l=3

•• Any k-pseudo frequent itemset of size l can be found by passing through those satisfying partial frequency condition

Let's do backtrack search

•• Always exist an item whose removal satisfies the condition

•• Tail extension is not available

(removal of tail may violate condition)

•• Simple hill climbing generates duplications

•• So, use a generation rule to avoid duplication (reverse search)

Restricted Search Route by P.F.C.Restricted Search Route by P.F.C.Restricted Search Route by P.F.C.Restricted Search Route by P.F.C.

•• Rule: generate itemset P from P ＼ {e} maximizing |Occk-1(P ＼{e})|

(Tie is broken by choosing the minimum index)

ReverseSearch (P)

1. if P|=1 then output P; return;

2. for each e∈P do

if P+e is a k-pseudo frequent itemset satisfying P.F.C. then

if e maximizes |Occk-1(P ＼ {e})| then ReverseSearch (P+e)

3. end for

•• |Occk-1(P ＼ {e})| can be efficiently computed by existing methods

Reverse Search for P.F.C.Reverse Search for P.F.C.Reverse Search for P.F.C.Reverse Search for P.F.C.

O(|P|×||D||) time for one iterationO(|P|×||D||) time for one iteration

ConclusionConclusionConclusionConclusion

•• Introduced ambiguous inclusion relation such that at most k items

of the pattern is not included

•• Pseudo frequent itemset mining under the inclusion (monotonicity,

intersection, many small-trivial patterns)

•• Reverse search for directly finding frequent itemset with fixed size

• • implementation and experiments

•• extension of the technique to other pattern mining

• • approach to inclusion with "ratio r %"

• • implementation and experiments

•• extension of the technique to other pattern mining

• • approach to inclusion with "ratio r %"

Future worksFuture works

an efficient polynomial delay algorithm for pseudo frequent itemset mining 2/oct/2007 discovery...

Documents