graph mining applications to machine learning problems

Graph Mining Applications to Machine Learning Problems

Max Planck Institute for Biological Cybernetics

Koji Tsuda

Graphs…

DNA Sequence

Texts in literature

Graph Structures in Biology

A C G C

Amitriptyline inhibits adenosine uptake

Compounds

U U U U

Substructure Representation

0/1 vector of pattern indicatorsHuge dimensionality!Need Graph Mining for selecting featuresBetter than paths (Marginalized graph kernels)

patterns

OverviewQuick Review on Graph Mining

EM-based Clustering algorithm Mixture model with L1 feature selection

Graph Boosting Supervised Regression for QSAR Analysis Linear programming meets graph mining

Quick Review of Graph Mining

Graph MiningAnalysis of Graph Databases Find all patterns satisfying

predetermined conditions Frequent Substructure Mining

Combinatorial, ExhaustiveRecently developed AGM (Inokuchi et al., 2000), gspan

(Yan et al., 2002), Gaston (2004)

Graph Mining

Frequent Substructure Mining Enumerate all patterns occurred in at

least m graphs

:Indicator of pattern k in graph i

Support(k): # of occurrence of pattern k

Gspan (Yan and Han, 2002)

Efficient Frequent Substructure Mining MethodDFS Code

Efficient detection of isomorphic patterns

Extend Gspan for our works

Enumeration on Tree-shaped Search Space

Each node has a patternGenerate nodes from the root: Add an edge at each step

Tree PruningAnti-monotonicity:

If support(g) < m, stop exploring!

Not generated

Support(g): # of occurrence of pattern g

Discriminative patterns:Weighted Substructure Mining

w_i > 0: positive classw_i < 0: negative classWeighted Substructure Mining

Patterns with large frequency differenceNot Anti-Monotonic: Use a bound

Multiclass version

Multiple weight vectors (graph belongs to

class ) (otherwise)

Search patterns overrepresented in a class

EM-based clustering of graphs

Tsuda, K. and T. Kudo: Clustering Graphs by Weighted Substructure Mining. ICML 2006, 953-960, 2006

EM-based graph clustering

Motivation Learning a mixture model in the

feature space of patterns Basis for more complex probabilistic

inference

L1 regularization & Graph MiningE-step -> Mining -> M-step

Probabilistic ModelBinomial Mixture

Each Component

:Mixing weight for cluster :Feature vector of a graph (0 or 1)

:Parameter vector for cluster

Function to minimize

L1-Regularized log likelihood

Baseline constant ML parameter estimate using single

binomial distribution

In solution, most parameters exactly equal to constants

E-step

Active pattern

E-step computed only with active patterns (computable!)

M-stepPutative cluster assignment by E-step

Each parameter is solved separately

Use graph mining to find active patternsThen, solve it only for active patterns

Solution

Occurrence probability in a cluster

Overall occurrence probability

Important Observation

For active pattern k, the occurrence probability in a graphcluster is significantly different from the average

Mining for Active Patterns F

F is rewritten in the following form

Active patterns can be found by graph mining! (multiclass)

Experiments: RNA graphsStem as a nodeSecondary structure by RNAfold0/1 Vertex label (self loop or not)

Clustering RNA graphs

Three Rfam families Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs)

Three bipartition problems Results evaluated by ROC scores

(Area under the ROC curve)

Examples of RNA Graphs

ROC Scores

No of Patterns & Time

Found Patterns

Summary (EM)Probabilistic clustering based on substructure representation Inference helped by graph miningMany possible extensions Naïve Bayes Graph PCA, LFD, CCA Semi-supervised learning

Applications in Biology?

Graph Boosting

Saigo, H., T. Kadowaki and K. Tsuda: A Linear Programming Approach for Molecular QSAR analysis. International Workshop on Mining and Learning with Graphs, 85-96, 2006

Graph Regression Problem

Known as QSAR problem in chemical informatics Quantitative Structure-Activity

Analysis

Given a graph, predict a real-value Typically, features (descriptors) are

QSAR with conventional descriptors

#atoms #bonds #rings … Activity

22 25 3

20 21 1.2

23 24 0.77

11 11 -3.52

21 22 -4

Motivation of Graph Boosting

Descriptors are not always availableNew features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpanLinear Programming (LP) Boosting for reducing the number of graph mining calls Accurate prediction & interpretable results

Molecule as a labeled graph

QSAR with patterns… Activity

1 1 1 3

-1 1 -1 1.2

-1 1 -1 0.77

-1 1 -1 -3.52

1 1 -1 -4

)? (fC

2 3 ...

Sparse regression in a very high dimensional space

G: all possible patterns (intractably large)|G|-dimensional feature vector x for a molecule Linear Regression

Use L1 regularizer to have sparse αSelect a tractable number of patterns

jjjxαf

Problem formulation

We introduce ε-insensitive loss and L1 regularizer

m: # of training graphs

d = |G|

ξ+, ξ- : slack variables

ε: parameter

Dual LP

Primal: Huge number of weight variables Dual: Huge number of constraintsLP1-Dual

Column Generation Algorithm for LP Boost (Demiriz et al., 2002)

Start from the dual with no constraintsAdd the most violated constraint each timeGuaranteed to converge Constraint Matrix

UsedPart

Finding the most violated constraint

Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted substructure mining

iijixu

iijij xu

maxarg

Algorithm Overview

Iteration Find a new pattern by graph mining with

weight u If all constraints are satisfied, break Add a new constraint Update u by LP1-Dual

Return Convert dual solution to obtain primal

solution α

Speed-up by adding multiple patterns (multiple pricing)

So far, the most violated pattern is chosen

Mining and inclusion of top k patterns at each iteration Reduction of the number of mining

iijij xu

maxarg

A Linear Programming Approach for Molecular QSAR Analysis

Speed-up by multiple pricing

Clearly negative data#atoms #bonds #rings … Activity

22 25 3

20 21 1.2

23 24 0.77

11 11 -3.52

21 22 -4

22 20 -10000

23 19 -10000

A Linear Programming Approach for Molecular QSAR Analysis

Inclusion of clearly negative data

LP2-Primal

l: # of clearly negative data

z: predetermined upperbound

ξ’ : slack variable

Experiments

Data from Endocrine Disruptors Knowledge Base 59 compounds labeled by real number and 61

compounds labeled by a large negative number

Label (target) is a log translated relative proliferative potency (log(RPP)) normalized between –1 and 1

Comparison with Marginalized Graph Kernel + ridge regression Marginalized Graph Kernel + kNN regression

Results with or without clearly negative data

Extracted patterns

Interpretable compared with implicitly expressed features by Marginalized Graph Kernel

Summary (Graph Boosting)

Graph Boosting simultaneously generate patterns and learn their weightsFinite convergence by column generationPotentially interpretable by chemists.Flexible constraints and speed-up by LP.

Concluding Remarks

Using graph mining as a part of machine learning algorithms Weights are essential Please include weights when you

implement your item-set/tree/graph mining algorithms

Make it available on the web! Then ML researchers can use it

graph mining applications to machine learning problems

graph isupportk

graph mining applications

separatelyuse graph

active patterns computable

active patterns ff

weighted substructure

indicator of pattern

following formactive

Documents

1 graph mining applications to machine learning problems max...

graph mining ibm -...

polonium: tera-scale graph mining for malware detection -...

graph mining - social network - multi-relation mining

large graph mining

graph problems

chapter 9 graph algorithms lec 21 dec 1, 2010. sample graph...

graph mining approach for large-scale data analysis...

managing and mining graph data

mining social graph data

graph spectra,information theory and complex neworks ·...

1 chapter 9 graph algorithms real-life graph problems...

mining graph patterns - ucsb computer...

conference: lts: discriminative subgraph mining by...

graph mining - pagerank

mining graph patterns

centrality and graph mining

graph partitioning problems

localized methods in graph mining

graph and web mining - motivation, applications and ... ·...