survey on frequent pattern mining on graph data - slides
TRANSCRIPT
Sriskandarajah SuhothayanKasun Gajasinghe
Isuru Loku NarangodaSubash Chaturanga
OutlineIntroductionBasic principlesSolution patterns
IntroductionGraphs can be seen in everywhere.In computer science, graph is viewed as an
abstract data structure which represents relationships among data.
Graph based data miningGraph based data mining is finding out useful
and understandable patterns from graph representation of data.
The main subject area of graph based data mining is identifying the frequently occurring subgraph patterns.
ApproachesIn the recent past a significant work has been
done in this subject area to develop algorithms to mine graph data efficiently.
In this paper we are discussing about such several well known algorithms under following categories.Mathematical Graph Theory Based
ApproachesGreedy Search Based ApproachesInductive Logic Programming ApproachInductive Database Based Approaches
ApplicationsBioInformatics
mine biochemical structures finding out biological conserved sub networks
Chemical compound analysisWeb browsing pattern analysisintrusion network analysismining communication networks
Basic PrinciplesSubgraph categories
general subgraphsinduced subgraphsconnected subgraphs
Subgraph Isomorphism ProblemThis finds whether there exists a one-to-one
mapping from a set of vertices to another set.
Basic PrinciplesGraph Invariants
Quantities to characterize the topological structure of a graph
number vertices, degree of each vertex number of edges connected to the vertex
Solution Approaches
direct
Categorization
Completeness
complete searchheuristic search
Subgraph isomorphismmatching problem
Indirect(solves the subgraph similarity problem)
Solution Approaches
Greedy search Inductive logic programming (ILP) Inductive database Complete level-wise search Support Vector Machine (SVM)
Greedy searchThe conventional solution
Categorized into Depth-First search (DFS) and Breadth-First Search (BFS) Beam search
The disadvantage: as the search proceeds it prunes the branches which do not fit to the maximum branch number limit
Inductive logic programming (ILP)
Induction?
combination of the 'abduction' (guessing) to select some hypotheses and the 'justification' to seek those hypotheses to justify the observed facts.
Inductive logic programming (ILP)
positive examples + negative examples => hypothesis+ background knowledge
background knowledge to control the search process (prune some search
paths) introduce predetermined subgraph patterns ILP can be in any of four categories
Inductive database
Subgraphs and relations among subgraphs are pre-generated sad stored in an inductive database
Advantage: fast operation as the basic patternsDisadvantage: large amount of computation
and memory utilization
Complete level-wise searchIt's Complete and Direct
Here data are not sets of items Rather graphs having the combinations of a
vertex set V(G) and an edge set E(G) which include topological information.
Extended approach of Apriori algorithm is used
Support Vector Machine (SVM)
Used for classification and regression analysis
A non-probabilistic binary linear classifier
SVN is a heuristic search and an indirect method in terms of subgraph isomorphism problem.
Categorization
Mathematical Graph Theory Based Approaches
Greedy Search Based Approaches Inductive Logic Programming Approach Inductive Database Based Approaches Kernel Function Based Approaches
Greedy Search Based Approaches
Use heuristics to evaluate the solution.
Two major works SUBDUE GBI
Graph Based Induction (GBI)Has two methods
one for chunking and the other for extracting patters.
Can arrive at local minimum solutions; using pair wise chunking at each step by the opportunistic beam search.
Ability to reconstruct the original graph as and when needed
The advantage of GBI is that it can handle both directed and undirected labelled graph even with closed paths which includes closed edges.
Use empirical graph size definition, limitation in continuously compressing the graph, graph never becomes a single vertex.
Extract substructures and construct a classifier.
SUBDUE
A graph-based relational learning system
Compress the graphs based on Minimum Description Length (MDL) principle
Not face high computational complexity (uses computationally constrained beam search)
Miss some optimum sub graphs
fewer number of highly interesting patterns; than generating a large number of patterns from which interesting patterns need to be identified.
Runtime much larger than gSpan and FSG: non-linear with the dataset size (because of the implementation of graph isomorphism problem)
Mathematical Approaches Apriori-based methods
– AGM– FSG
Pattern Growth methods– gSpan
Apriori-based Approach AGM
– Used to mine “frequent induced subgraphs”
– Works with both directed and undirected graphs
– Importantly, this algorithm is not limited to the connected graphs. It also supports isolated graphs.
AGMBreadth first search. Create new candidates for level k+1
by joining two graphs at level k.
AGM generates new graphs by adding a new node:
And then proceeds as per Apriori...
FSG– FSG works better on graph data sets with more
edge and vertex labels– This is an optimized algorithm of AGM with added
techniques for efficiency.– FSG increases the efficiency of the candidate
generation of frequent subgraphs by introducing the Transaction ID (TID) method.
– efficient candidate subgraph generation algorithms.
FSG– FSG is a apriori-based and therefore uses level-
wise algorithm
– Faces two challenges: candidate generation: the generation of size
subgraph candidates is more complicated and costly
pruning false positives: subgraph isomorphism test is an NP-complete problem
gSpan– Uses Depth-First-Search (DFS)– can be used to find frequent sub graphs one by
one from small to large ones.
– Advantages• No candidate generation and false test• Better saving of space by DFS.
Pattern growth mathod
GRAPH DATASET
FREQUENT PATTERNS(MIN SUPPORT IS 2)
(A) (B) (C)
(1) (2)
Another three approaches to mine graph based data.
Inductive Logic Programming approach Inductive database approach Kernel function based approach
ILP approach.
ILP systems constructs predictive model for a given data set by searching large space of candidate hypothesis.
WARMR – proposed in 1998. Combination of
Apriori-like level wise search and IPL method. But have a high computational complexity.
FARMER – proposed in 2011. Runs two orders of magnitude than WARMER.
Inductive DB approach.
Databases which are capable of handling patterns within data. Quite different from from typical data bases.
Uses interactive querying process to mine data in these data bases.
MolFea is an effort related to this area. Has a
better computational efficiency which mines linear fragments in chemical compounds..
Also this performs a complete search of the paths in graph data.
Kernel Function based approach
This “kernel” function basically defines similarity between two graphs
The paper consists of two efforts done based on this approach, which classifies the graphs in to binary classes by SVM (Support Vector - Machine).