cartesian contour: a concise representation for a collection of frequent sets ruoming jin kent state...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Cartesian Contour: A Concise Representation for a Collection of
Frequent Sets
Ruoming Jin Kent State University
Joint work with Yang Xiang and Lin Liu (KSU)
Frequent Pattern Mining
• Summarizing the underlying datasets, providing key insights
• Key building block for data mining toolbox– Association rule mining– Classification– Clustering– Change Detection– etc…
• Application Domains– Business, biology, chemistry, WWW, computer/netwo
ring security, software engineering, …
The Problem
• The number of patterns is too large• Attempt
– Maximal Frequent Itemsets – Closed Frequent Itemsets– Non-Derivable Itemsets – Compressed or Top-k Patterns– …
• Tradeoff– Significant Information Loss– Large Size
Pattern Summarization
• Using a small number of itemsets to best represent the entire collection of frequent itemsets– The Spanning Set Approach [Afrati-Gionis-Mannila,
KDD04]– Exact Description = Maximal Frequent Itemsets
• Our problem:– Can we find a concise representation which can allow
both exact and approximate summarization of a collection of frequent itemsets?
Basic Idea
{A,B,G,H}, {A,B,I,J}, {A,B,K,L}
{C,D,G,H}, {C,D,I,J}, {C,D,K,L}
{E,F,G,H}, {E,F,I,J}, {E,F,K,L}
9 itemsets, 36 items.
{{A,B},{C,D},{E,F}}
Picturing
Cartesian Product
{{G,H},{I,J},{K,L}}
1 biclique, 6 itemsets, 12 items
Covering
Cartesian Covering
Non-frequent itemsets
Problem Formulation• Cartesian product
– e.g.
• Cost of a Cartesian product– e.g. 1 biclique, 3 itemsets, and 5 items
• Covering– e.g.
}},,{},,{},,{},,{},{},{},{{
}},,,{},,,{},,,{
},,{},,,{},,{},,{},,{
},{},,{},,{},{},{},{},{{
DCEDCDECEDCE
DCBADCBDCA
DBACBADCDBCB
DACABADCBA
}},,{},,,,{{}},{{}}{},,{{ DCEDCBADCEBA
},,{},,,{ 22}),{}}{},,({{ DCEDCBADCEBAC
How can we use Cartesian products to concisely represent a collection of frequent itemsets?
Exact and Approximate Covering
Exact Representation
Approximate Representation
}},{{}{
}},{{}}{{
DC
BAG
}},{},,{{}}{{ DCBAG Cost: 1 biclique, 3 itemsets, 5 items
False positive: {G,C},{G,D},{G,C,D}
Cost: 2 biclique, 4 itemsets, 6 items
False positive: none
Covering Maximal Frequent Itemsets
{{ABC}, {CDE}}
{{GHI}, {JKL}}
{{MNO}, {PQR}}
{{STU}, {VWX}}
ABCSTU
ABCGHI
CDESTU
CDEGHI
CDEVWX
CDEJKL MNOVWX
MNOGHI
PQRJKL
Problem Reformulation
Given Maximal Frequent Itemsets:
}}{},{},{},{},{},{{ EGHKDGHKDEFJCEFJABDIABCI
Frequent Itemsets
Exact representation
Approximate representation
C1 C2 C1 C2
Minimal Biclique Set Cover Problem
Ground Set: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
1, 2,3,4,6,7,8,9
5,10,11
NP-hardness
• By reducing the Minimal Biclique Set Cover into our problem, we can easily prove our problem1 (exact) and problem2 (approximate) are NP-hard.
• Minimal Biclique Set Cover is a Variant of the Classical Set Cover Problem
Can we use the standard set-cover greedy algorithm?
Naïve greedy algorithm
• Greedy algorithm:– Each time choose a biclique with the lowest price
.
– is the cost.– This method has a logarithmic approximation
bound.
• The problem?– The number of candidate bicliques are 2|X|+|Y| !!
||
||||)(
ii YXee
iii S
YXC
|||| ii YX
Candidate Reduction
• Assume one side of the biclique candidate is known, how to choose the other side?
Greedy Algorithm
Fixed!
Split and sort
Covering 4 Covering 3 Covering 3
Add 1st single Y-vertex Biclique
Cost = 1;
Add 2nd single Y-vertex Biclique
Cost = 5/7;
Add 3th single Y-vertex Biclique
Cost = 6/8> 5/7
Cheapest sub-biclique!
Biclique Candidate
Approximation Bound of the Greedy Algorithm
The greedy SubBiclique procedure can find a sub-biclique whose price is less than or equal to e/(1-e) of the price of the optimal sub-biclique (cheapest price)!
Further Reduction
• Only using the IDEA1, the time complexity is still exponential .
• How to reduce this further??– Are all the combinations equally
important?– No, because some are more likely to connect to
the Y side.– Our solution: Frequent itemset mining!
||2 || YX
||2 X
Using Frequent Itemset Mining
Overall Algorithm
• Step 1: Use the Frequent Itemset Mining tool to find all the (one-side maximal) biclique candidates;
• Step 2: Calculate the cheapest sub-biclique for each candidate using the greedy procedure;
• Step 3: Compare all the sub-bicliques, choose the cheapest one;
• Step 4: if MFI totally covered, done; else go to Step 2.
Approximation Bound
Our algorithm has e/(1-e) (ln (n)+1) approximation ratio with respect to the candidate set (all the sub-bicliques with one sides coming from the frequent itemset mining).
Speed-up techniques (1)
• Using Closed itemsets for X and Y– Initially X and Y contain all the FI, respectively.– Using to cover MFI is similar to factorizing MFI;
– MFI’s maximal factor itemsets are closed itemsets, whose number is much smaller!
},,,,,,{ 1111 mllm YXYXYXYXYX
Speed-up techniques (2)A,B C ,D E,F A,G K,L G ,H B,G A,I
G I J M A B
G ,I M ,G B,I
(1)
Sparse Graph Dense Graph
# Frequent itemsets is small;
Valuable biclique candidates are not be fully used!
# Frequent itemsets is big;
Handling those candidates are too slow!
Frequent Itemset
Supporting Transaction
TRADEOFF
Speed-up techniques (3)
• Iterative procedure– A large number of closed itemsets;– To cover MFI in one time can produce a huge
number of biclique candidates;– So to cover MFI in several times ; – Support level is reduced gradually!
Experiments
• Data sets:
Conclusion
• We propose an interesting summarization problem which consider the interaction between frequent patterns
• We transform this problem into a generalized minimal biclique covering problem and design an approximate algorithm with bound
• The experimental results demonstrate the effective and efficiency of our approach
Thank you !!!
Reference[Bayardo98] Roberto J. Bayardo Jr. Efficiently mining long patterns from databases. SIGMOD98.[Pasquier99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Descovering frequent clo
sed itemsets for association rules. ICDT99.[Calder07] Toon Calder and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discover.
07.[Han02] Jiawei Han, Jianyong Wang, Ying Lu and Petre Tzvetkov. Mining top-k frequent closed patter
ns without minimum support. ICDM02.[Xin06] Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patt
erns. KDD06.[Xin05] Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern set
s. VLDB05.[Afrati04] Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets.
KDD04.[Yan05] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarization itemset patterns: a profil
e-based approach. KDD05.[Wang06] Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilisti
c models. KDD06.[Jin08] Ruoming Jin, Muad Abu-Ata, Yang Xiang, and Ning Ruan. Effective and efficient itemset patte
rn summarization: regression-based approaches. KDD08.[Xiang08] Yang Xiang, Ruoming Jin, David Fuhy, and Feodor F. Dragan. Succinct Summarization of t
ransactional databases: an overlapped hyperrectangle scheme. KDD08.
Related Work
• K-itemset approximation: [Afrati04].– Difference:
• their work is a special case of our work;• their work is expensive for exact description;• Our work use set cover and max-k cover methods.
• Restoring the frequency of frequent itemsets: [Yan05, Wang06, Jin08].
• Hyperrectangle covering problem: [Xiang08].