lei shi department of computer science and engineering state university of new york at buffalo
DESCRIPTION
Seminar 2009. Frequent Subgraph/ Substructure Mining. Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo. Outline. Introduction Apriori-based Subgrah Mining Pattern Growth Subgraph Mining Summary. Graphs are everywhere. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/1.jpg)
University at BuffaloThe State University of New York
Lei Shi
Department of Computer Science and
Engineering
State University of New York at Buffalo
Frequent Subgraph/ Substructure Mining
Seminar 2009
![Page 2: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/2.jpg)
University at BuffaloThe State University of New York
Outline
Introduction
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
![Page 3: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/3.jpg)
University at BuffaloThe State University of New York
Graphs are everywhere
![Page 4: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/4.jpg)
University at BuffaloThe State University of New York
Graph Mining Problems
Graph Pattern Mining• Frequent subgraph pattern mining• Pattern summarization• Optimal graph patterns• Graph patterns with constraints• Approximate graph patterns ….
Graph Classification• Graph clustering• Important node identification• Bridge and hub identification
Other Important Topics • Graph compression• Graph model• Social network analysis.
![Page 5: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/5.jpg)
University at BuffaloThe State University of New York
Subgraph pattern Mining
Frequent subgraph• A (sub)graph is frequent if its support (occurrence frequency) in a
given dataset is no less than a minimum support threshold
Application of subgraph pattern mining• Mining biochemical structures
• Program control flow analysis
• Mining XML structures or Web communities
• Building blocks for graph classifiction, clustering,compression, comparison and correlation analysis.
![Page 6: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/6.jpg)
University at BuffaloThe State University of New York
(1) (2) (3)
BC
A A
B
A
A
BC
C BC
A
AA
subgraph
331Support
Frequent Subgraph Example
![Page 7: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/7.jpg)
University at BuffaloThe State University of New York
Key Challenges in Subgraph Mining
Graph isomorphism• to detect if two graphs are identical in structure
Graph representation (Canonical Labeling) • A canonical label is a unique code of a given graph.
• Canonical label should be the same no matter how graphs are represented, as long as graphs have the same topological structure and the same labeling of edges and vertices.
Subgraph candidate generation• generate candidate frequent subgraphs from datasets
![Page 8: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/8.jpg)
University at BuffaloThe State University of New York
Subgraph Mining Approaches
Apriori-based • AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001
• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)
Pattern growth based• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)
Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721
• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)
![Page 9: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/9.jpg)
University at BuffaloThe State University of New York
Outline
Introduction and Background
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
![Page 10: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/10.jpg)
University at BuffaloThe State University of New York
Apriori-based Approach
FSG : Frequent subgraph discovery. In ICDM’01, Nov. 2001 M.Kuramochi and G. Karypis.
Flattened Representation as Canonical Labeling
Apriori-based method to generate subgraph candidate
![Page 11: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/11.jpg)
University at BuffaloThe State University of New York
Graph Representation in FSG
Flattened Representation
00000 10 ee
![Page 12: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/12.jpg)
University at BuffaloThe State University of New York
Graph Representation in FSG
Flatterned Representation
Lexicographic order or dictionary order
![Page 13: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/13.jpg)
University at BuffaloThe State University of New York
Apriori-based method
Apriori Property• If a graph is frequent, all of
its subgraphs are frequent.
Candidate Generation• Create a set of candidate size k+1
-from given two frequent k-subgraphs
-containing the same (k-1)-subgraph
-Result in several candidates size k+1
![Page 14: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/14.jpg)
University at BuffaloThe State University of New York
Apriori-based method
Graph candidate generated Example
![Page 15: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/15.jpg)
University at BuffaloThe State University of New York
Apriori-based method
FlowChart
![Page 16: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/16.jpg)
University at BuffaloThe State University of New York
Apriori-based method
Experiment Result-Chemical Compound Dataset, which contains 340
compounds,24 different atoms (vertices)
![Page 17: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/17.jpg)
University at BuffaloThe State University of New York
Outline
Introduction
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
![Page 18: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/18.jpg)
University at BuffaloThe State University of New York
Motivation of gSpan
Weakness of Apriori-based approach• The generation of size (k+1) subgraph candidates from
size k frequent subgraph too complicated and complex.• Pruning false positive : subgraph isomorphism is an NP
complete problem which is costly.
gSpan: Graph-Based Substructure Pattern Mining
• Change the way to represent a graph (DFS: Depth First Search)
• Using pattern growth to generate new subgraph candidate.
![Page 19: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/19.jpg)
University at BuffaloThe State University of New York
gSpan: Graph-Based Substructure Pattern Mining
DFS (Depth First Search) Code
• First Step: DFS the graph and use edges on the path to represent the graph.
• Second Step: DFS Lexicographic Order
Pattern Growth subgraph generation
![Page 20: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/20.jpg)
University at BuffaloThe State University of New York
DFS code
An edge is presented by 5 tuples.
( , )( , , , , )
(0,1, , , )
i i j ji j l l l
X a Y
![Page 21: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/21.jpg)
University at BuffaloThe State University of New York
DFS code
Second Step: DFS Lexicographic Order
![Page 22: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/22.jpg)
University at BuffaloThe State University of New York
Pattern Growth Approach
Pattern Growth (free extension)
![Page 23: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/23.jpg)
University at BuffaloThe State University of New York
Pattern Growth Approach
Duplicate Graphs
![Page 24: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/24.jpg)
University at BuffaloThe State University of New York
Pattern Growth Approach
Free extension
![Page 25: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/25.jpg)
University at BuffaloThe State University of New York
Pattern Growth Approach
Right most extension
![Page 26: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/26.jpg)
University at BuffaloThe State University of New York
Pattern Growth Approach
Exmaples (cont.)
![Page 27: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/27.jpg)
University at BuffaloThe State University of New York
gSpan
![Page 28: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/28.jpg)
University at BuffaloThe State University of New York
gSpan
![Page 29: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/29.jpg)
University at BuffaloThe State University of New York
Pattern Growth Approach
Experimental result using Chemical data
•340 molecules
66 atom types and
4 bond types as labels
•On average only 27 vertices with 28 edges
![Page 30: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/30.jpg)
University at BuffaloThe State University of New York
Summary
Graph representationFlattern representation vs. DFS code
Generation of Candidate Patternsapriori vs. pattern growth
![Page 31: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/31.jpg)
University at BuffaloThe State University of New York
![Page 32: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/32.jpg)
University at BuffaloThe State University of New York
Pattern-Growth Approach
![Page 33: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/33.jpg)
University at BuffaloThe State University of New York
Frequent Graph Pattern
Given a graph dataset D, find subgraph g, s.t.
Where is the percentage of graphs in D that contain g.
Problem 1 : Exponential Pattern Set
Problem 2 : Threshold Setting
)(gfreq
)(gfreq
![Page 34: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/34.jpg)
University at BuffaloThe State University of New York
Difference between frequent itemset and frequent subgraph discovery
![Page 35: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/35.jpg)
University at BuffaloThe State University of New York
Frequent itemset discovery
![Page 36: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/36.jpg)
University at BuffaloThe State University of New York
subgraph Mining Algorithms
Apriori-based approach– AGM/AcGM: Inokuchi, et al. (PKDD’00)– FSG: Kuramochi and Karypis (ICDM’01)– PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)– FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)– FTOSM: Horvath et al. (KDD’06) Pattern growth approach– Subdue: Holder et al. (KDD’94)– MoFa: Borgelt and Berthold (ICDM’02)– gSpan: Yan and Han (ICDM’02)– Gaston: Nijssen and Kok (KDD’04)– CMTreeMiner: Chi et al. (TKDE’05)– LEAP: Yan et al. (SIGMOD’08)
![Page 37: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/37.jpg)
University at BuffaloThe State University of New York
Framework of subraph Mining Algorithms
Search Orderbreadth vs. depthcomplete vs. incomplete Generation of Candidate Patternsapriori vs. pattern growth Discovery Order of Patterns DFS orderpath tree graph Elimination of Duplicate Subgraphspassive vs. active Support Calculationembedding store or not
![Page 38: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/38.jpg)
University at BuffaloThe State University of New York
Frequent Subgraph
Examples:
![Page 39: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/39.jpg)
University at BuffaloThe State University of New York
Example (cont.)
![Page 40: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/40.jpg)
University at BuffaloThe State University of New York
Subgraph Mining Approaches
Apriori-based approach• AGM/AcGM: Inokuchi, et al. (PKDD’00)• FSG: Kuramochi and Karypis (ICDM’01)
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001
• PATH#: Vanetik and Gudes (ICDM’02, ICDM’04)• FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04)• FTOSM: Horvath et al. (KDD’06)
Pattern growth approach• Subdue: Holder et al. (KDD’94)• MoFa: Borgelt and Berthold (ICDM’02)• gSpan: Yan and Han (ICDM’02)
Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721
• Gaston: Nijssen and Kok (KDD’04)• CMTreeMiner: Chi et al. (TKDE’05)• LEAP: Yan et al. (SIGMOD’08)
![Page 41: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/41.jpg)
University at BuffaloThe State University of New York
Outline
Introduction and Background
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
DFS codeYan, X. and Han, J. 2002. gSpan : Graph-Based Substructure
Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721
![Page 42: Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo](https://reader035.vdocuments.us/reader035/viewer/2022070410/56814667550346895db38a2c/html5/thumbnails/42.jpg)
University at BuffaloThe State University of New York
Pattern Growth Approach