close graph

18
Close Graph (Improvements of gSpan) Sayeed Mahmud

Upload: sayeed-mahmud

Post on 03-Aug-2015

63 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Close Graph

Close Graph(Improvements of gSpan)

Sayeed Mahmud

Page 2: Close Graph

gSpan – where we left of

• Two graph pattern mining techniques– Apriori based – Pattern growth based

• gSpan is based on pattern growth philosophy• Reduced two main bottlenecks of apriori

based techniques– Join 2 K-edge graph to generate k+1 edge graph– Checking frequency of generated candidate

separately

Page 3: Close Graph

Pattern Growth before gSpan • gSpan improved several other aspects of

previous pattern growth approach (like NaiveGraph).

• A n-edge graph can be extended in n ways

• There are also isomorphism• These extended graph were duplicate graphs• With naïve approaches it was necessary to

check these duplicate graphs

Page 4: Close Graph

gSpan over Naive

• Reduced the generation of duplicate graph– only expands rightmost

• No need to search for duplicate checking– Can tell it by dfs codes

• Eliminates duplicate graph and still gurantees that no frequent candidate is left out– DFS Code Covering property

Page 5: Close Graph

Backlogs of gSpan

• Competitive performance edge over previous pattern growth algorithms

• Performance throttle is seen when using large graph size

• A graph with 64 edges may have 264 frequent sub-graphs.

• gSpan may have to extend a lot of them.• Implementation throttles (DFS based

implementation)

Page 6: Close Graph

Backlogs of gSpan

• Most of the generated frequent candidates contain no helpful information in analysis– Just shares the common frequency– Example: AIDS Antiviral screen dataset

• 1,000,000 frequent dataset 5% min support• Only 422 is meaningful

• gSpan has no terminating mechanism, will generate all these frequent candidates.

• Will be difficult to analyse with huge number of frequent candidates.

Page 7: Close Graph

Closed Frequent Itemset

• Instead of mining all frequent sub-graphs, we can get away with mining frequent close-graphs.

• An graph is close frequent if its frequent and none of its proper super graph has same frequency.

Is not closed

Frequency 2

Supergraph

Frequency 2

In this case

Is closed

No supergraph having same frequency

Page 8: Close Graph

Close Frequent Itemset

• The notion of non-close frequent item set is wherever the super-graph occurs, the sub-graph occurs (frequency same).

• So we don’t need to extend them again leading to meaningless candidates

• And as the sub graph occurs whenever super-graph occurs, we are not ruling them out either.

• This reduces the amount of item generated.– Among 1,000,000 frequent candidates only 2000 are

closed in AIDS antiviral screen dataset.

Page 9: Close Graph

CloseGraph

• Mines all close frequent patterns given a graph dataset.

• Pattern growth approach• Built on top of gSpan• Features added to gSpan– Equivalent Occurrence and Extended frequency

counting– Early termination.

• Outperforms gSpan by a factor of 4 to 10.

Page 10: Close Graph

Equivalent Occurrence

• The target is to return recursive call as early as possible – early termination

• Equivalent Occurrence is the condition for early termination.

• If we find certain super-graph (or isomorphic super-graph) that has the same support count of the sub-graph we return.

• The same support count super graph means wherever the large graph occurred the small graph also occurred (Equivalent Occurrence)

Page 11: Close Graph

Equivalent Occurrence

• CloseGraph introduces one more paramter than gSpan – Extended frequency count– L(g, g’, G)

• Number of super-graph (or its isomorphism) g’ that contains g or one of its isomorphic as sub-graph.– The super graph (g’) must be one edge extended

from g.

Page 12: Close Graph

Early termination using Equivalent Occurrence

Graph Dataset D:

Frequecy 2 + 1 + 0 = 3

Frequecy 2 + 1 + 0 = 3

Extended from g1 Extended from g2

Frequecy 2 + 1 + 0 = 3

Extended from g1

Frequecy Apparently 3 !!

1. We didn’t extend g1 and yet its generatedWith appropriate frequency count

2. This is because of equivalent occurrenceWe don’t need to extend g1 we can extend the super-graph g2

g1 g2 g3

Page 13: Close Graph

Early Termination Failure

Support Count Blue-Red = 2

Support Count Blue-Red –Yellow = 2

False Assumption : Equivalent OccurrenceExtend Blue-Red-Yellow instead of Blue-Red !!

We shall miss this frequent pattern

Can only be extended from Blue-Red

Page 14: Close Graph

Detecting Failure

• Passively – do not explicitly look for it• Assume there is no failure and continue• After generation look for certain characteristics

Frequent - Extend

Frequent and Equivalent, extend this instead of g1

Both non frequent and have a common edge !!!

Page 15: Close Graph

Performance Comparison

• All comparison between gSpan and CloseGraph is done in the following configuration– Intel Pentium IV 1.7GHz– 1GB RAM– RedHat Linux– G++ compiler with STL Support implementation– AIDS Antiviral screen compound dataset

Page 16: Close Graph

0 0.02 0.04 0.06 0.0810

100

1000

10000

CloseGraphgSpan

Runti

me

(sec

)

Minimum support

Page 17: Close Graph

0.02 0.04 0.06 0.0810

100

1000

10000

100000

1000000

10000000

closed frequent graphsfrequent graphs

Num

ber o

f Patt

erns

(x1k

)

Minimum support

Page 18: Close Graph

Thank You