close graph
TRANSCRIPT
Close Graph(Improvements of gSpan)
Sayeed Mahmud
gSpan – where we left of
• Two graph pattern mining techniques– Apriori based – Pattern growth based
• gSpan is based on pattern growth philosophy• Reduced two main bottlenecks of apriori
based techniques– Join 2 K-edge graph to generate k+1 edge graph– Checking frequency of generated candidate
separately
Pattern Growth before gSpan • gSpan improved several other aspects of
previous pattern growth approach (like NaiveGraph).
• A n-edge graph can be extended in n ways
• There are also isomorphism• These extended graph were duplicate graphs• With naïve approaches it was necessary to
check these duplicate graphs
gSpan over Naive
• Reduced the generation of duplicate graph– only expands rightmost
• No need to search for duplicate checking– Can tell it by dfs codes
• Eliminates duplicate graph and still gurantees that no frequent candidate is left out– DFS Code Covering property
Backlogs of gSpan
• Competitive performance edge over previous pattern growth algorithms
• Performance throttle is seen when using large graph size
• A graph with 64 edges may have 264 frequent sub-graphs.
• gSpan may have to extend a lot of them.• Implementation throttles (DFS based
implementation)
Backlogs of gSpan
• Most of the generated frequent candidates contain no helpful information in analysis– Just shares the common frequency– Example: AIDS Antiviral screen dataset
• 1,000,000 frequent dataset 5% min support• Only 422 is meaningful
• gSpan has no terminating mechanism, will generate all these frequent candidates.
• Will be difficult to analyse with huge number of frequent candidates.
Closed Frequent Itemset
• Instead of mining all frequent sub-graphs, we can get away with mining frequent close-graphs.
• An graph is close frequent if its frequent and none of its proper super graph has same frequency.
Is not closed
Frequency 2
Supergraph
Frequency 2
In this case
Is closed
No supergraph having same frequency
Close Frequent Itemset
• The notion of non-close frequent item set is wherever the super-graph occurs, the sub-graph occurs (frequency same).
• So we don’t need to extend them again leading to meaningless candidates
• And as the sub graph occurs whenever super-graph occurs, we are not ruling them out either.
• This reduces the amount of item generated.– Among 1,000,000 frequent candidates only 2000 are
closed in AIDS antiviral screen dataset.
CloseGraph
• Mines all close frequent patterns given a graph dataset.
• Pattern growth approach• Built on top of gSpan• Features added to gSpan– Equivalent Occurrence and Extended frequency
counting– Early termination.
• Outperforms gSpan by a factor of 4 to 10.
Equivalent Occurrence
• The target is to return recursive call as early as possible – early termination
• Equivalent Occurrence is the condition for early termination.
• If we find certain super-graph (or isomorphic super-graph) that has the same support count of the sub-graph we return.
• The same support count super graph means wherever the large graph occurred the small graph also occurred (Equivalent Occurrence)
Equivalent Occurrence
• CloseGraph introduces one more paramter than gSpan – Extended frequency count– L(g, g’, G)
• Number of super-graph (or its isomorphism) g’ that contains g or one of its isomorphic as sub-graph.– The super graph (g’) must be one edge extended
from g.
Early termination using Equivalent Occurrence
Graph Dataset D:
Frequecy 2 + 1 + 0 = 3
Frequecy 2 + 1 + 0 = 3
Extended from g1 Extended from g2
Frequecy 2 + 1 + 0 = 3
Extended from g1
Frequecy Apparently 3 !!
1. We didn’t extend g1 and yet its generatedWith appropriate frequency count
2. This is because of equivalent occurrenceWe don’t need to extend g1 we can extend the super-graph g2
g1 g2 g3
Early Termination Failure
Support Count Blue-Red = 2
Support Count Blue-Red –Yellow = 2
False Assumption : Equivalent OccurrenceExtend Blue-Red-Yellow instead of Blue-Red !!
We shall miss this frequent pattern
Can only be extended from Blue-Red
Detecting Failure
• Passively – do not explicitly look for it• Assume there is no failure and continue• After generation look for certain characteristics
Frequent - Extend
Frequent and Equivalent, extend this instead of g1
Both non frequent and have a common edge !!!
Performance Comparison
• All comparison between gSpan and CloseGraph is done in the following configuration– Intel Pentium IV 1.7GHz– 1GB RAM– RedHat Linux– G++ compiler with STL Support implementation– AIDS Antiviral screen compound dataset
0 0.02 0.04 0.06 0.0810
100
1000
10000
CloseGraphgSpan
Runti
me
(sec
)
Minimum support
0.02 0.04 0.06 0.0810
100
1000
10000
100000
1000000
10000000
closed frequent graphsfrequent graphs
Num
ber o
f Patt
erns
(x1k
)
Minimum support
Thank You