cartesian contour: a concise representation for a collection of frequent sets ruoming jin kent state...

Cartesian Contour: A Concise Representation for a Collection of

Frequent Sets

Ruoming Jin Kent State University

Joint work with Yang Xiang and Lin Liu (KSU)

Frequent Pattern Mining

• Summarizing the underlying datasets, providing key insights

• Key building block for data mining toolbox– Association rule mining– Classification– Clustering– Change Detection– etc…

• Application Domains– Business, biology, chemistry, WWW, computer/netwo

ring security, software engineering, …

The Problem

• The number of patterns is too large• Attempt

– Maximal Frequent Itemsets – Closed Frequent Itemsets– Non-Derivable Itemsets – Compressed or Top-k Patterns– …

• Tradeoff– Significant Information Loss– Large Size

Pattern Summarization

• Using a small number of itemsets to best represent the entire collection of frequent itemsets– The Spanning Set Approach [Afrati-Gionis-Mannila,

KDD04]– Exact Description = Maximal Frequent Itemsets

• Our problem:– Can we find a concise representation which can allow

both exact and approximate summarization of a collection of frequent itemsets?

Basic Idea

{A,B,G,H}, {A,B,I,J}, {A,B,K,L}

{C,D,G,H}, {C,D,I,J}, {C,D,K,L}

{E,F,G,H}, {E,F,I,J}, {E,F,K,L}

9 itemsets, 36 items.

{{A,B},{C,D},{E,F}}

Picturing

Cartesian Product

{{G,H},{I,J},{K,L}}

1 biclique, 6 itemsets, 12 items

Covering

Cartesian Covering

Non-frequent itemsets

Problem Formulation• Cartesian product

– e.g.

• Cost of a Cartesian product– e.g. 1 biclique, 3 itemsets, and 5 items

• Covering– e.g.

}},,{},,{},,{},,{},{},{},{{

}},,,{},,,{},,,{

},,{},,,{},,{},,{},,{

},{},,{},,{},{},{},{},{{

DCEDCDECEDCE

DCBADCBDCA

DBACBADCDBCB

DACABADCBA

}},,{},,,,{{}},{{}}{},,{{ DCEDCBADCEBA

},,{},,,{ 22}),{}}{},,({{ DCEDCBADCEBAC

How can we use Cartesian products to concisely represent a collection of frequent itemsets?

Exact and Approximate Covering

Exact Representation

Approximate Representation

}},{{}{

}},{{}}{{

DC

BAG

}},{},,{{}}{{ DCBAG Cost: 1 biclique, 3 itemsets, 5 items

False positive: {G,C},{G,D},{G,C,D}

Cost: 2 biclique, 4 itemsets, 6 items

False positive: none

Covering Maximal Frequent Itemsets

{{ABC}, {CDE}}

{{GHI}, {JKL}}

{{MNO}, {PQR}}

{{STU}, {VWX}}

ABCSTU

ABCGHI

CDESTU

CDEGHI

CDEVWX

CDEJKL MNOVWX

MNOGHI

PQRJKL

Problem Reformulation

Given Maximal Frequent Itemsets:

}}{},{},{},{},{},{{ EGHKDGHKDEFJCEFJABDIABCI

Frequent Itemsets

Exact representation

Approximate representation

C1 C2 C1 C2

Minimal Biclique Set Cover Problem

Ground Set: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

1, 2,3,4,6,7,8,9

5,10,11

NP-hardness

• By reducing the Minimal Biclique Set Cover into our problem, we can easily prove our problem1 (exact) and problem2 (approximate) are NP-hard.

• Minimal Biclique Set Cover is a Variant of the Classical Set Cover Problem

Can we use the standard set-cover greedy algorithm?

Naïve greedy algorithm

• Greedy algorithm:– Each time choose a biclique with the lowest price

.

– is the cost.– This method has a logarithmic approximation

bound.

• The problem?– The number of candidate bicliques are 2|X|+|Y| !!

||

||||)(

ii YXee

iii S

YXC

|||| ii YX

Candidate Reduction

• Assume one side of the biclique candidate is known, how to choose the other side?

Greedy Algorithm

Fixed!

Split and sort

Covering 4 Covering 3 Covering 3

Add 1st single Y-vertex Biclique

Cost = 1;

Add 2nd single Y-vertex Biclique

Cost = 5/7;

Add 3th single Y-vertex Biclique

Cost = 6/8> 5/7

Cheapest sub-biclique!

Biclique Candidate

Approximation Bound of the Greedy Algorithm

The greedy SubBiclique procedure can find a sub-biclique whose price is less than or equal to e/(1-e) of the price of the optimal sub-biclique (cheapest price)!

Further Reduction

• Only using the IDEA1, the time complexity is still exponential .

• How to reduce this further??– Are all the combinations equally

important?– No, because some are more likely to connect to

the Y side.– Our solution: Frequent itemset mining!

||2 || YX

||2 X

Using Frequent Itemset Mining

Overall Algorithm

• Step 1: Use the Frequent Itemset Mining tool to find all the (one-side maximal) biclique candidates;

• Step 2: Calculate the cheapest sub-biclique for each candidate using the greedy procedure;

• Step 3: Compare all the sub-bicliques, choose the cheapest one;

• Step 4: if MFI totally covered, done; else go to Step 2.

Approximation Bound

Our algorithm has e/(1-e) (ln (n)+1) approximation ratio with respect to the candidate set (all the sub-bicliques with one sides coming from the frequent itemset mining).

Speed-up techniques (1)

• Using Closed itemsets for X and Y– Initially X and Y contain all the FI, respectively.– Using to cover MFI is similar to factorizing MFI;

– MFI’s maximal factor itemsets are closed itemsets, whose number is much smaller!

},,,,,,{ 1111 mllm YXYXYXYXYX

Speed-up techniques (2)A,B C ,D E,F A,G K,L G ,H B,G A,I

G I J M A B

G ,I M ,G B,I

(1)

Sparse Graph Dense Graph

# Frequent itemsets is small;

Valuable biclique candidates are not be fully used!

# Frequent itemsets is big;

Handling those candidates are too slow!

Frequent Itemset

Supporting Transaction

TRADEOFF

Speed-up techniques (3)

• Iterative procedure– A large number of closed itemsets;– To cover MFI in one time can produce a huge

number of biclique candidates;– So to cover MFI in several times ; – Support level is reduced gradually!

Experiments

• Data sets:

Conclusion

• We propose an interesting summarization problem which consider the interaction between frequent patterns

• We transform this problem into a generalized minimal biclique covering problem and design an approximate algorithm with bound

• The experimental results demonstrate the effective and efficiency of our approach

Thank you !!!

Reference[Bayardo98] Roberto J. Bayardo Jr. Efficiently mining long patterns from databases. SIGMOD98.[Pasquier99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Descovering frequent clo

sed itemsets for association rules. ICDT99.[Calder07] Toon Calder and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discover.

07.[Han02] Jiawei Han, Jianyong Wang, Ying Lu and Petre Tzvetkov. Mining top-k frequent closed patter

ns without minimum support. ICDM02.[Xin06] Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patt

erns. KDD06.[Xin05] Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern set

s. VLDB05.[Afrati04] Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets.

KDD04.[Yan05] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarization itemset patterns: a profil

e-based approach. KDD05.[Wang06] Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilisti

c models. KDD06.[Jin08] Ruoming Jin, Muad Abu-Ata, Yang Xiang, and Ning Ruan. Effective and efficient itemset patte

rn summarization: regression-based approaches. KDD08.[Xiang08] Yang Xiang, Ruoming Jin, David Fuhy, and Feodor F. Dragan. Succinct Summarization of t

ransactional databases: an overlapped hyperrectangle scheme. KDD08.

Related Work

• K-itemset approximation: [Afrati04].– Difference:

• their work is a special case of our work;• their work is expensive for exact description;• Our work use set cover and max-k cover methods.

• Restoring the frequency of frequent itemsets: [Yan05, Wang06, Jin08].

• Hyperrectangle covering problem: [Xiang08].

cartesian contour: a concise representation for a collection of frequent sets ruoming jin kent state...

Documents