graph mining applications to machine learning problems
Post on 30-Dec-2015
51 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Graph Mining Applications to Machine Learning Problems
Max Planck Institute for Biological Cybernetics
Koji Tsuda
2
Graphs…
3
DNA Sequence
RNA
Texts in literature
Graph Structures in Biology
C
C OC
C
C
C
H
A C G C
Amitriptyline inhibits adenosine uptake
H
H
H
H
H
Compounds
CG
CG
U U U U
UA
4
Substructure Representation
0/1 vector of pattern indicatorsHuge dimensionality!Need Graph Mining for selecting featuresBetter than paths (Marginalized graph kernels)
patterns
5
OverviewQuick Review on Graph Mining
EM-based Clustering algorithm Mixture model with L1 feature selection
Graph Boosting Supervised Regression for QSAR Analysis Linear programming meets graph mining
6
Quick Review of Graph Mining
7
Graph MiningAnalysis of Graph Databases Find all patterns satisfying
predetermined conditions Frequent Substructure Mining
Combinatorial, ExhaustiveRecently developed AGM (Inokuchi et al., 2000), gspan
(Yan et al., 2002), Gaston (2004)
8
Graph Mining
Frequent Substructure Mining Enumerate all patterns occurred in at
least m graphs
:Indicator of pattern k in graph i
Support(k): # of occurrence of pattern k
9
Gspan (Yan and Han, 2002)
Efficient Frequent Substructure Mining MethodDFS Code
Efficient detection of isomorphic patterns
Extend Gspan for our works
10
Enumeration on Tree-shaped Search Space
Each node has a patternGenerate nodes from the root: Add an edge at each step
11
Tree PruningAnti-monotonicity:
If support(g) < m, stop exploring!
Not generated
Support(g): # of occurrence of pattern g
12
Discriminative patterns:Weighted Substructure Mining
w_i > 0: positive classw_i < 0: negative classWeighted Substructure Mining
Patterns with large frequency differenceNot Anti-Monotonic: Use a bound
13
Multiclass version
Multiple weight vectors (graph belongs to
class ) (otherwise)
Search patterns overrepresented in a class
14
EM-based clustering of graphs
Tsuda, K. and T. Kudo: Clustering Graphs by Weighted Substructure Mining. ICML 2006, 953-960, 2006
15
EM-based graph clustering
Motivation Learning a mixture model in the
feature space of patterns Basis for more complex probabilistic
inference
L1 regularization & Graph MiningE-step -> Mining -> M-step
16
Probabilistic ModelBinomial Mixture
Each Component
:Mixing weight for cluster :Feature vector of a graph (0 or 1)
:Parameter vector for cluster
17
Function to minimize
L1-Regularized log likelihood
Baseline constant ML parameter estimate using single
binomial distribution
In solution, most parameters exactly equal to constants
18
E-step
Active pattern
E-step computed only with active patterns (computable!)
19
M-stepPutative cluster assignment by E-step
Each parameter is solved separately
Use graph mining to find active patternsThen, solve it only for active patterns
20
Solution
Occurrence probability in a cluster
Overall occurrence probability
21
Important Observation
For active pattern k, the occurrence probability in a graphcluster is significantly different from the average
22
Mining for Active Patterns F
F is rewritten in the following form
Active patterns can be found by graph mining! (multiclass)
23
Experiments: RNA graphsStem as a nodeSecondary structure by RNAfold0/1 Vertex label (self loop or not)
24
Clustering RNA graphs
Three Rfam families Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs)
Three bipartition problems Results evaluated by ROC scores
(Area under the ROC curve)
25
Examples of RNA Graphs
26
ROC Scores
27
No of Patterns & Time
28
Found Patterns
29
Summary (EM)Probabilistic clustering based on substructure representation Inference helped by graph miningMany possible extensions Naïve Bayes Graph PCA, LFD, CCA Semi-supervised learning
Applications in Biology?
30
Graph Boosting
Saigo, H., T. Kadowaki and K. Tsuda: A Linear Programming Approach for Molecular QSAR analysis. International Workshop on Mining and Learning with Graphs, 85-96, 2006
31
Graph Regression Problem
Known as QSAR problem in chemical informatics Quantitative Structure-Activity
Analysis
Given a graph, predict a real-value Typically, features (descriptors) are
given
32
QSAR with conventional descriptors
#atoms #bonds #rings … Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
33
Motivation of Graph Boosting
Descriptors are not always availableNew features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpanLinear Programming (LP) Boosting for reducing the number of graph mining calls Accurate prediction & interpretable results
34
Molecule as a labeled graph
C
C
CC
CC
O
CC C
C
35
QSAR with patterns… Activity
1 1 1 3
-1 1 -1 1.2
-1 1 -1 0.77
-1 1 -1 -3.52
1 1 -1 -4
C
C
C
C
C
C
CC
C
C
C
C
CC
CC
O
Cl
C
)? (fC
C
C
C
C
C
CC
C
C
C
C
CC
CC
O
Cl
C1
2 3 ...
36
Sparse regression in a very high dimensional space
G: all possible patterns (intractably large)|G|-dimensional feature vector x for a molecule Linear Regression
Use L1 regularizer to have sparse αSelect a tractable number of patterns
d
jjjxαf
1
)(x
37
Problem formulation
We introduce ε-insensitive loss and L1 regularizer
m: # of training graphs
d = |G|
ξ+, ξ- : slack variables
ε: parameter
38
Dual LP
Primal: Huge number of weight variables Dual: Huge number of constraintsLP1-Dual
39
Column Generation Algorithm for LP Boost (Demiriz et al., 2002)
Start from the dual with no constraintsAdd the most violated constraint each timeGuaranteed to converge Constraint Matrix
UsedPart
40
Finding the most violated constraint
Constraint for a pattern (shown again)
Finding the most violated one
Searched by weighted substructure mining
m
iijixu
1
11
m
iijij xu
1
maxarg
41
Algorithm Overview
Iteration Find a new pattern by graph mining with
weight u If all constraints are satisfied, break Add a new constraint Update u by LP1-Dual
Return Convert dual solution to obtain primal
solution α
42
Speed-up by adding multiple patterns (multiple pricing)
So far, the most violated pattern is chosen
Mining and inclusion of top k patterns at each iteration Reduction of the number of mining
calls
m
iijij xu
1
maxarg
A Linear Programming Approach for Molecular QSAR Analysis
43
Speed-up by multiple pricing
44
Clearly negative data#atoms #bonds #rings … Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
22 20 -10000
23 19 -10000
A Linear Programming Approach for Molecular QSAR Analysis
45
Inclusion of clearly negative data
LP2-Primal
l: # of clearly negative data
z: predetermined upperbound
ξ’ : slack variable
46
Experiments
Data from Endocrine Disruptors Knowledge Base 59 compounds labeled by real number and 61
compounds labeled by a large negative number
Label (target) is a log translated relative proliferative potency (log(RPP)) normalized between –1 and 1
Comparison with Marginalized Graph Kernel + ridge regression Marginalized Graph Kernel + kNN regression
47
Results with or without clearly negative data
LP2
LP1
48
Extracted patterns
Interpretable compared with implicitly expressed features by Marginalized Graph Kernel
49
Summary (Graph Boosting)
Graph Boosting simultaneously generate patterns and learn their weightsFinite convergence by column generationPotentially interpretable by chemists.Flexible constraints and speed-up by LP.
50
Concluding Remarks
Using graph mining as a part of machine learning algorithms Weights are essential Please include weights when you
implement your item-set/tree/graph mining algorithms
Make it available on the web! Then ML researchers can use it
top related