an efficient polynomial delay algorithm for pseudo frequent itemset mining 2/oct/2007 discovery...
TRANSCRIPT
An Efficient Polynomial Delay Algorithm for
Pseudo Frequent Itemset Mining
An Efficient Polynomial Delay Algorithm for
Pseudo Frequent Itemset Mining
2/Oct/2007 Discovery Science 2007
Takeaki Uno (National Institute of Informatics)
Hiroki Arimura (Hokkaido University)
Frequent Pattern MiningFrequent Pattern MiningFrequent Pattern MiningFrequent Pattern Mining
•• problem of finding all frequently appearing patterns from (large scale) database
database: transaction, tree, string, graph, vector pattern: subset, tree, path, sequence, graph, geograph…
Genome infoexperiments
database
ATGCGCCGTATAGCGGGTGGTTCGCGTTAGGGATATAAATGCGCCAAATAATAATGTATTATTGAAGGGCGACAGTCTCTCAATAAGCGGCT
ATGCGCCGTATAGCGGGTGGTTCGCGTTAGGGATATAAATGCGCCAAATAATAATGTATTATTGAAGGGCGACAGTCTCTCAATAAGCGGCT
ex1 ex2 ex3 ex4 ● ▲ ▲
● ▲ ● ● ▲ ● ● ● ▲ ● ▲ ● ●
● ▲ ● ● ▲ ▲ ▲ ▲
•• ex1● ,ex3 ▲•• ex2● ,ex4●•• ex2●, ex3 ▲, ex4●• • ex2▲ ,ex3 ▲ . . .
•• ex1● ,ex3 ▲•• ex2● ,ex4●•• ex2●, ex3 ▲, ex4●• • ex2▲ ,ex3 ▲ . . .
•• ATGCAT•• CCCGGGTAA•• GGCGTTA•• ATAAGGG . . .
•• ATGCAT•• CCCGGGTAA•• GGCGTTA•• ATAAGGG . . .
This ResearchThis ResearchThis ResearchThis Research
•• address transaction database
transaction database: transaction database: each record (transaction) T of the database is a subset of the itemset E, i.e., DD, ∀∀T ∈DD, T ⊆ E
frequent itemset: frequent itemset: subset of E included in at least σ transactions
•• problems
- - so many patterns for finding valuable patterns
- - inclusion is strict, to deal with errors
"patterns ambiguously included in many transactions" are impotant
We introduce an ambiguous inclusion, and propose an efficient mining algorithm
We introduce an ambiguous inclusion, and propose an efficient mining algorithm
minimum support thresholdminimum support threshold
Related WorksRelated WorksRelated WorksRelated Works
•• Such frequent itemset mining with ambiguity is called
fault-tolerant pattern, degenerate pattern, soft occurrence
-- ambiguity for inclusion is, "pattern is included if the ratio of
included items is more than the threshold
-- another approach: find combinations of itemset and
transaction set, such that few pairs of item and transaction do
not satisfy inclusion relation
-- similarity is used, for string matching and homology search
•• Few "enumeration type" research with completeness
Look at practical models and algorithms, from algorithm theoryLook at practical models and algorithms, from algorithm theory
Notations for F.I.M.Notations for F.I.M.Notations for F.I.M.Notations for F.I.M.
•• For itemset K,
occurrence of occurrence of K:: transaction of D D including K
Occ(K):: occurrence set of occurrence set of K:: the set of occurrences of K
frq(K):: frequency of frequency of K: : the size of Occ(K)
1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92
D ==
Occ( {1,2} )== { {1,2,5,6,7,9}, {1,2,7,8,9} }
Occ( {2,7,9} )== { {1,2,5,6,7,9}, {1,2,7,8,9}, {2,7,9} }
Frequent ItemsetFrequent ItemsetFrequent ItemsetFrequent Itemset
•• Frequent itemset:Frequent itemset: itemset with frequency no less than σ
( σ is called minimum support (threshold) )
Ex.)Ex.) 1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92
DD ==
Itemsets included in no Itemsets included in no less than 3 transactionsless than 3 transactions{1} {2} {7} {9}{1,7} {1,9}{2,7} {2,9} {7,9}{1,7,9} {2,7,9}
Frequent itemset mining: problem of enumerating all frequent itemsets for given database D and minimum support σ
Frequent itemset mining: problem of enumerating all frequent itemsets for given database D and minimum support σ
Inclusion with AmbiguityInclusion with AmbiguityInclusion with AmbiguityInclusion with Ambiguity
•• Ambiguous inclusion relation for itemset P and transaction T
•• Popular definition: |P∩T| / |P| θ ≧ for threshold θ<1
lose monotonicity of frequent itemsets
there is a frequent itemset s.t. "any its subset is infrequent"
much cost for computation
{1,2,3} ⊆ {1,2,4,5} for θ= 0.6{1,2,3} ⊆ {1,2,4,5} for θ= 0.6
{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for θ= 0.6{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for θ= 0.6
{1,2,3} ⊆ {1,4,5} for θ= 0.6{1,2,3} ⊆ {1,4,5} for θ= 0.6
θ= 0.6{1,2}{2,3}{1,3}
θ= 0.6{1,2}{2,3}{1,3}
{1,2,3} included in allsubset not for any
{1,2,3} included in allsubset not for any
k-pseudo Inclusionk-pseudo Inclusionk-pseudo Inclusionk-pseudo Inclusion
•• Use threshold for #non-included items:
k-pseudo inclusion: |P \ T| ≦k for threshold k ≧ 0
( k-pseudo [occurrence / occurrence set / frequency] )
monotonicity is kept
able to find characterizations such as
"many transactions include at least 3 items of P"
{1,2,3} ⊆ {1,2,4,5} for k = 1{1,2,3} ⊆ {1,2,4,5} for k = 1
{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for k = 1{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for k = 1
{1,2,3} ⊆ {1,4,5} for k = 1{1,2,3} ⊆ {1,4,5} for k = 1
kk Pseudo Frequent Itemset Pseudo Frequent Itemsetkk Pseudo Frequent Itemset Pseudo Frequent Itemset
•• k-pseudo frequent itemset:k-pseudo frequent itemset: itemset k-pseudo included in at least σ transactions of DD
1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92
D D ==
1-pseudo frequent itemsets for σ=31-pseudo frequent itemsets for σ=3{1,2,3} {1,2,4} {1,2,5} {1,2,7} {1,2,9} {1,3,7} {1,3,9} {1,4,7} {1,4,9} {1,5,7} {1,5,9} {1,6,7} {1,6,9} {1,7,8} {1,7,9} {1,8,9} {2,3,7} {2,3,9} {2,4,7} {2,4,9} {2,5,7} {2,5,8} {2,5,9} {2,6,7} {2,6,9} {2,7,8} {2,7,9} {2,8,9} {3,7,9} {4,7,9} {5,7,9} {6,7,9} {7,8,9} {1,2,7,9} {1,3,7,9} {1,4,7,9}{1,5,7,9} {1,6,7,9} {1,7,8,9} {2,3,7,9} {2,4,7,9} {2,5,7,9} {2,6,7,9} {2,7,8,9}
Many trivial patternsHow to efficiently enumerate?
Many trivial patternsHow to efficiently enumerate?
Enumeration using MonotonicityEnumeration using MonotonicityEnumeration using MonotonicityEnumeration using Monotonicity
•• Pseudo frequent itemsets have monotone property thereby simple backtrack algorithm work
•• For each k-pseudo frequent itemset P, compute k-pseudo frequency of each P+e
•• If the k-pseudo frequency of P+eis no less than σ, generate recursivecall to enumerate k-pseudo frequent itemsets including P+e
freqfreq
111…1
000…0
φ
1,31,2
1,2,3 1,2,4 1,3,4 2,3,4
1 2 3 4
3,42,41,4 2,3
1,2,3,4
Polynomial time enumerationPolynomial time enumeration
How to efficiently computate?How to efficiently computate?
Computing k-Pseudo OccurrencesComputing k-Pseudo OccurrencesComputing k-Pseudo OccurrencesComputing k-Pseudo Occurrences
•• Define Occ=h(P) = { T∈D | |P \ T| = h } set of transactions missing just h items of P
Occ≦k(P) = ∪h≦kOcc=h(P)
•• Occ=h(P∪e) = Occ=h(P)∩Occ(e) ∪ Occ=h-1(P) \ Occ(e)
update of pseudo occurrence set is done by taking intersection
•• compute Occ=h(P)∩Occ(e) for all pair of e and h
ABCDEFG
ABCDEFG
ABCF
ABCF
BCF
BCF
8 9 10 11 12
ABEFG
ABEFG
ABCD
ABCD
CDCD
ABCDF
ABCDF
BACDF
BACDF
ABCDEFG
ABCDEFG
ABCD
ABCD
ABCD
ABCD
Occ0
Occ1
Occ2P
Taking Intersections EfficientlyTaking Intersections EfficientlyTaking Intersections EfficientlyTaking Intersections Efficiently
•• Occ=h(P∪e) = Occ=h(P)∩Occ(e) ∪ Occ=h-1(P) \ Occ(e)
having the same properties as usual occurrences
can use many existing techniques for updating occurrence set
(down project, delivery, bitmap…)
•• Database reduction (FP-tree)
is also available
•• In deeper levels of recursion,
transactions to be scanned
becomes few, thereby
the computation is fast
A: 1,2,5,6,7,9B: 2,3,4,5C: 1,2,7,8,9D: 1,7,9E: 2,7,9F: 2
1: A,C,D2: A,B,C,E,F3: B4: B5: A,B6: A7: A,C,D,E8: C9: A,C,D,E
Using Bottom-widenessUsing Bottom-widenessUsing Bottom-widenessUsing Bottom-wideness
•• Backtrack (depth-first search) generates several recursive calls in each iteration
The computation tree spreads exponentially by going down
The computation time is dominated by the bottom level iterations on the recursion tree
Amortized computation time is reduced to that of bottom levelsAmortized computation time is reduced to that of bottom levels
・・・・・・
long timelong time
short timeshort time
Since occurrences to
be computed
is few in lower levels,
Since occurrences to
be computed
is few in lower levels,
For Large Minimum SupportFor Large Minimum SupportFor Large Minimum SupportFor Large Minimum Support
•• When σ is large, we access many transactions on the bottom levels
Improvements by bottom-wideness is not drastic
•• Reduce the database to speed up the bottoms
(1) (1) Delete items less than the maximum item in P
(2)(2) Delete items being infrequent on the occurrence set database
(since it never be added in the recursive call)
(3)(3) unify the same transactions
•• The database size is constant in the
bottom levels in practice
No big difference from small σNo big difference from small σ
1 3 5
1 2 3 4 6
1 7
2 3 4 6 7
3 4 5 6 7
2 3 4 6 7
P={1,3}, k=1, σ=4
•• Under the k-pseudo inclusion, itemsets of size no more than k is included in anyany transaction
•• itemsets of size bit greater than k is also included in many transactions
Many small and trivial frequent itemsets
•• We want to ignore these itemsets in practice
Consider problem of directly finding
pseudo frequent itemsets of size l
Small & Trivial PatternsSmall & Trivial PatternsSmall & Trivial PatternsSmall & Trivial Patterns
•• Need exponential time if search all itemsets of size l
Pruning unnecessary search is crucial
Take candidates according to partial structure
•• Let P be a k-pseudo frequent itemset of size l
•• WLOG, P={1,…,l} and
sorted in decreasing order of |Occ=k(P) \ Occ({e})|
•• Consider the (k-1)-pseudo frequency of itemset {1,…,y}
•• Any transaction in Occ=k(P) \ Occ({e}), e>y
(k-1)-pseudo includes {1,…,y}
Directly Finding Large ItemsetDirectly Finding Large ItemsetDirectly Finding Large ItemsetDirectly Finding Large Itemset
•• Any transaction in Occ=k(P) \ Occ({e}), e>y
(k-1)-pseudo includes {1,…,y}
|Occk-1({1,…,y})| ≧ |∪e=y+1,...,|P| (Occk(P) \ Occ({e}))|
•• average of |Occk(P) \ Occ({e})| is no less than (k / |P|) |Occ=k
(P)|
•• 1,…,y are sorted in increasing order of |Occk(P) \ Occ({e})|
|Occk-1({1,…,y})| ≧ |Occk(P)|×(|P|-y)/|P|
Search Route to Itemset of Size Search Route to Itemset of Size llSearch Route to Itemset of Size Search Route to Itemset of Size ll
There is a sequence of itemsets from empty set to P composed only of itemsets satisfying partial frequency condition
There is a sequence of itemsets from empty set to P composed only of itemsets satisfying partial frequency condition
Partial frequency condition
Partial frequency condition
Example for Partial Frequency ConditionExample for Partial Frequency ConditionExample for Partial Frequency ConditionExample for Partial Frequency Condition
1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92
D D ==
1-pseudo frequent itemsets1-pseudo frequent itemsets satisfying the partial frequency conditionsatisfying the partial frequency condition{1} {2} {5} {7} {9} {1,2} {1,5} {1,6} {1,7} {1,8} {1,9} {2,3} {2,4} {2,5} {2,6} {2,7} {2,8} {2,9} {3,5} {4,5} {5,6} {5,7} {5,9} {6,7} {6,9} {7,8} {7,9} {8,9}
#frequent itemsets to be searched is decreased, efficient search is expected
#frequent itemsets to be searched is decreased, efficient search is expected
•• Itemsets satisfying the partial frequency condition,
for k=1, σ=3, l=3
•• Any k-pseudo frequent itemset of size l can be found by passing through those satisfying partial frequency condition
Let's do backtrack search
•• Always exist an item whose removal satisfies the condition
•• Tail extension is not available
(removal of tail may violate condition)
•• Simple hill climbing generates duplications
•• So, use a generation rule to avoid duplication (reverse search)
Restricted Search Route by P.F.C.Restricted Search Route by P.F.C.Restricted Search Route by P.F.C.Restricted Search Route by P.F.C.
•• Rule: generate itemset P from P \ {e} maximizing |Occk-1(P \{e})|
(Tie is broken by choosing the minimum index)
ReverseSearch (P)
1. if P|=1 then output P; return;
2. for each e∈P do
if P+e is a k-pseudo frequent itemset satisfying P.F.C. then
if e maximizes |Occk-1(P \ {e})| then ReverseSearch (P+e)
3. end for
•• |Occk-1(P \ {e})| can be efficiently computed by existing methods
Reverse Search for P.F.C.Reverse Search for P.F.C.Reverse Search for P.F.C.Reverse Search for P.F.C.
O(|P|×||D||) time for one iterationO(|P|×||D||) time for one iteration
ConclusionConclusionConclusionConclusion
•• Introduced ambiguous inclusion relation such that at most k items
of the pattern is not included
•• Pseudo frequent itemset mining under the inclusion (monotonicity,
intersection, many small-trivial patterns)
•• Reverse search for directly finding frequent itemset with fixed size
• • implementation and experiments
•• extension of the technique to other pattern mining
• • approach to inclusion with "ratio r %"
• • implementation and experiments
•• extension of the technique to other pattern mining
• • approach to inclusion with "ratio r %"
Future worksFuture works