survey on frequent pattern mining on graph data - slides

Sriskandarajah SuhothayanKasun Gajasinghe

Isuru Loku NarangodaSubash Chaturanga

OutlineIntroductionBasic principlesSolution patterns

IntroductionGraphs can be seen in everywhere.In computer science, graph is viewed as an

abstract data structure which represents relationships among data.

Graph based data miningGraph based data mining is finding out useful

and understandable patterns from graph representation of data.

The main subject area of graph based data mining is identifying the frequently occurring subgraph patterns.

ApproachesIn the recent past a significant work has been

done in this subject area to develop algorithms to mine graph data efficiently.

In this paper we are discussing about such several well known algorithms under following categories.Mathematical Graph Theory Based

ApproachesGreedy Search Based ApproachesInductive Logic Programming ApproachInductive Database Based Approaches

ApplicationsBioInformatics

mine biochemical structures finding out biological conserved sub networks

Chemical compound analysisWeb browsing pattern analysisintrusion network analysismining communication networks

Basic PrinciplesSubgraph categories

general subgraphsinduced subgraphsconnected subgraphs

Subgraph Isomorphism ProblemThis finds whether there exists a one-to-one

mapping from a set of vertices to another set.

Basic PrinciplesGraph Invariants

Quantities to characterize the topological structure of a graph

number vertices, degree of each vertex number of edges connected to the vertex

Solution Approaches

direct

Categorization

Completeness

complete searchheuristic search

Subgraph isomorphismmatching problem

Indirect(solves the subgraph similarity problem)

Solution Approaches

Greedy search Inductive logic programming (ILP) Inductive database Complete level-wise search Support Vector Machine (SVM)

Greedy searchThe conventional solution

Categorized into Depth-First search (DFS) and Breadth-First Search (BFS) Beam search

The disadvantage: as the search proceeds it prunes the branches which do not fit to the maximum branch number limit

Inductive logic programming (ILP)

Induction?

combination of the 'abduction' (guessing) to select some hypotheses and the 'justification' to seek those hypotheses to justify the observed facts.

Inductive logic programming (ILP)

positive examples + negative examples => hypothesis+ background knowledge

background knowledge to control the search process (prune some search

paths) introduce predetermined subgraph patterns ILP can be in any of four categories

Inductive database

Subgraphs and relations among subgraphs are pre-generated sad stored in an inductive database

Advantage: fast operation as the basic patternsDisadvantage: large amount of computation

and memory utilization

Complete level-wise searchIt's Complete and Direct

Here data are not sets of items Rather graphs having the combinations of a

vertex set V(G) and an edge set E(G) which include topological information.

Extended approach of Apriori algorithm is used

Support Vector Machine (SVM)

Used for classification and regression analysis

A non-probabilistic binary linear classifier

SVN is a heuristic search and an indirect method in terms of subgraph isomorphism problem.

Categorization

Mathematical Graph Theory Based Approaches

Greedy Search Based Approaches Inductive Logic Programming Approach Inductive Database Based Approaches Kernel Function Based Approaches

Greedy Search Based Approaches

Use heuristics to evaluate the solution.

Two major works SUBDUE GBI

Graph Based Induction (GBI)Has two methods

one for chunking and the other for extracting patters.

Can arrive at local minimum solutions; using pair wise chunking at each step by the opportunistic beam search.

Ability to reconstruct the original graph as and when needed

The advantage of GBI is that it can handle both directed and undirected labelled graph even with closed paths which includes closed edges.

Use empirical graph size definition, limitation in continuously compressing the graph, graph never becomes a single vertex.

Extract substructures and construct a classifier.

SUBDUE

A graph-based relational learning system

Compress the graphs based on Minimum Description Length (MDL) principle

Not face high computational complexity (uses computationally constrained beam search)

Miss some optimum sub graphs

fewer number of highly interesting patterns; than generating a large number of patterns from which interesting patterns need to be identified.

Runtime much larger than gSpan and FSG: non-linear with the dataset size (because of the implementation of graph isomorphism problem)

Mathematical Approaches Apriori-based methods

– AGM– FSG

Pattern Growth methods– gSpan

Apriori-based Approach AGM

– Used to mine “frequent induced subgraphs”

– Works with both directed and undirected graphs

– Importantly, this algorithm is not limited to the connected graphs. It also supports isolated graphs.

AGMBreadth first search. Create new candidates for level k+1

by joining two graphs at level k.

AGM generates new graphs by adding a new node:

And then proceeds as per Apriori...

FSG– FSG works better on graph data sets with more

edge and vertex labels– This is an optimized algorithm of AGM with added

techniques for efficiency.– FSG increases the efficiency of the candidate

generation of frequent subgraphs by introducing the Transaction ID (TID) method.

– efficient candidate subgraph generation algorithms.

FSG– FSG is a apriori-based and therefore uses level-

wise algorithm

– Faces two challenges: candidate generation: the generation of size

subgraph candidates is more complicated and costly

pruning false positives: subgraph isomorphism test is an NP-complete problem

gSpan– Uses Depth-First-Search (DFS)– can be used to find frequent sub graphs one by

one from small to large ones.

– Advantages• No candidate generation and false test• Better saving of space by DFS.

Pattern growth mathod

GRAPH DATASET

FREQUENT PATTERNS(MIN SUPPORT IS 2)

(A) (B) (C)

(1) (2)

Another three approaches to mine graph based data.

Inductive Logic Programming approach Inductive database approach Kernel function based approach

ILP approach.

ILP systems constructs predictive model for a given data set by searching large space of candidate hypothesis.

WARMR – proposed in 1998. Combination of

Apriori-like level wise search and IPL method. But have a high computational complexity.

FARMER – proposed in 2011. Runs two orders of magnitude than WARMER.

Inductive DB approach.

Databases which are capable of handling patterns within data. Quite different from from typical data bases.

Uses interactive querying process to mine data in these data bases.

MolFea is an effort related to this area. Has a

better computational efficiency which mines linear fragments in chemical compounds..

Also this performs a complete search of the paths in graph data.

Kernel Function based approach

This “kernel” function basically defines similarity between two graphs

The paper consists of two efforts done based on this approach, which classifies the graphs in to binary classes by SVM (Support Vector - Machine).

survey on frequent pattern mining on graph data - slides

Technology

graph data

search paths

search process

search dfs

search proceeds

data mining graph

original graph

graph representation