1 graph mining applications to machine learning problems max planck institute for biological...
Post on 19-Dec-2015
221 views
TRANSCRIPT
![Page 1: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/1.jpg)
1
Graph Mining Applications to Machine Learning Problems
Max Planck Institute for Biological Cybernetics
Koji Tsuda
![Page 2: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/2.jpg)
2
Graphs…
![Page 3: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/3.jpg)
3
DNA Sequence
RNA
Texts in literature
Graph Structures in Biology
C
C OC
C
C
C
H
A C G C
Amitriptyline inhibits adenosine uptake
H
H
H
H
H
Compounds
CG
CG
U U U U
UA
![Page 4: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/4.jpg)
4
Substructure Representation
0/1 vector of pattern indicatorsHuge dimensionality!Need Graph Mining for selecting featuresBetter than paths (Marginalized graph kernels)
patterns
![Page 5: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/5.jpg)
5
OverviewQuick Review on Graph Mining
EM-based Clustering algorithm Mixture model with L1 feature selection
Graph Boosting Supervised Regression for QSAR Analysis Linear programming meets graph mining
![Page 6: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/6.jpg)
6
Quick Review of Graph Mining
![Page 7: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/7.jpg)
7
Graph MiningAnalysis of Graph Databases Find all patterns satisfying
predetermined conditions Frequent Substructure Mining
Combinatorial, ExhaustiveRecently developed AGM (Inokuchi et al., 2000), gspan
(Yan et al., 2002), Gaston (2004)
![Page 8: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/8.jpg)
8
Graph Mining
Frequent Substructure Mining Enumerate all patterns occurred in at
least m graphs
:Indicator of pattern k in graph i
Support(k): # of occurrence of pattern k
![Page 9: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/9.jpg)
9
Gspan (Yan and Han, 2002)
Efficient Frequent Substructure Mining MethodDFS Code
Efficient detection of isomorphic patterns
Extend Gspan for our works
![Page 10: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/10.jpg)
10
Enumeration on Tree-shaped Search Space
Each node has a patternGenerate nodes from the root: Add an edge at each step
![Page 11: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/11.jpg)
11
Tree PruningAnti-monotonicity:
If support(g) < m, stop exploring!
Not generated
Support(g): # of occurrence of pattern g
![Page 12: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/12.jpg)
12
Discriminative patterns:Weighted Substructure Mining
w_i > 0: positive classw_i < 0: negative classWeighted Substructure Mining
Patterns with large frequency differenceNot Anti-Monotonic: Use a bound
![Page 13: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/13.jpg)
13
Multiclass version
Multiple weight vectors (graph belongs to
class ) (otherwise)
Search patterns overrepresented in a class
![Page 14: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/14.jpg)
14
EM-based clustering of graphs
Tsuda, K. and T. Kudo: Clustering Graphs by Weighted Substructure Mining. ICML 2006, 953-960, 2006
![Page 15: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/15.jpg)
15
EM-based graph clustering
Motivation Learning a mixture model in the
feature space of patterns Basis for more complex probabilistic
inference
L1 regularization & Graph MiningE-step -> Mining -> M-step
![Page 16: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/16.jpg)
16
Probabilistic ModelBinomial Mixture
Each Component
:Mixing weight for cluster :Feature vector of a graph (0 or 1)
:Parameter vector for cluster
![Page 17: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/17.jpg)
17
Function to minimize
L1-Regularized log likelihood
Baseline constant ML parameter estimate using single
binomial distribution
In solution, most parameters exactly equal to constants
![Page 18: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/18.jpg)
18
E-step
Active pattern
E-step computed only with active patterns (computable!)
![Page 19: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/19.jpg)
19
M-stepPutative cluster assignment by E-step
Each parameter is solved separately
Use graph mining to find active patternsThen, solve it only for active patterns
![Page 20: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/20.jpg)
20
Solution
Occurrence probability in a cluster
Overall occurrence probability
![Page 21: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/21.jpg)
21
Important Observation
For active pattern k, the occurrence probability in a graphcluster is significantly different from the average
![Page 22: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/22.jpg)
22
Mining for Active Patterns F
F is rewritten in the following form
Active patterns can be found by graph mining! (multiclass)
![Page 23: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/23.jpg)
23
Experiments: RNA graphsStem as a nodeSecondary structure by RNAfold0/1 Vertex label (self loop or not)
![Page 24: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/24.jpg)
24
Clustering RNA graphs
Three Rfam families Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs)
Three bipartition problems Results evaluated by ROC scores
(Area under the ROC curve)
![Page 25: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/25.jpg)
25
Examples of RNA Graphs
![Page 26: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/26.jpg)
26
ROC Scores
![Page 27: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/27.jpg)
27
No of Patterns & Time
![Page 28: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/28.jpg)
28
Found Patterns
![Page 29: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/29.jpg)
29
Summary (EM)Probabilistic clustering based on substructure representation Inference helped by graph miningMany possible extensions Naïve Bayes Graph PCA, LFD, CCA Semi-supervised learning
Applications in Biology?
![Page 30: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/30.jpg)
30
Graph Boosting
Saigo, H., T. Kadowaki and K. Tsuda: A Linear Programming Approach for Molecular QSAR analysis. International Workshop on Mining and Learning with Graphs, 85-96, 2006
![Page 31: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/31.jpg)
31
Graph Regression Problem
Known as QSAR problem in chemical informatics Quantitative Structure-Activity
Analysis
Given a graph, predict a real-value Typically, features (descriptors) are
given
![Page 32: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/32.jpg)
32
QSAR with conventional descriptors
#atoms #bonds #rings … Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
![Page 33: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/33.jpg)
33
Motivation of Graph Boosting
Descriptors are not always availableNew features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpanLinear Programming (LP) Boosting for reducing the number of graph mining calls Accurate prediction & interpretable results
![Page 34: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/34.jpg)
34
Molecule as a labeled graph
C
C
CC
CC
O
CC C
C
![Page 35: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/35.jpg)
35
QSAR with patterns… Activity
1 1 1 3
-1 1 -1 1.2
-1 1 -1 0.77
-1 1 -1 -3.52
1 1 -1 -4
C
C
C
C
C
C
CC
C
C
C
C
CC
CC
O
Cl
C
)? (fC
C
C
C
C
C
CC
C
C
C
C
CC
CC
O
Cl
C1
2 3 ...
![Page 36: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/36.jpg)
36
Sparse regression in a very high dimensional space
G: all possible patterns (intractably large)|G|-dimensional feature vector x for a molecule Linear Regression
Use L1 regularizer to have sparse αSelect a tractable number of patterns
d
jjjxαf
1
)(x
![Page 37: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/37.jpg)
37
Problem formulation
We introduce ε-insensitive loss and L1 regularizer
m: # of training graphs
d = |G|
ξ+, ξ- : slack variables
ε: parameter
![Page 38: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/38.jpg)
38
Dual LP
Primal: Huge number of weight variables Dual: Huge number of constraintsLP1-Dual
![Page 39: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/39.jpg)
39
Column Generation Algorithm for LP Boost (Demiriz et al., 2002)
Start from the dual with no constraintsAdd the most violated constraint each timeGuaranteed to converge Constraint Matrix
UsedPart
![Page 40: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/40.jpg)
40
Finding the most violated constraint
Constraint for a pattern (shown again)
Finding the most violated one
Searched by weighted substructure mining
m
iijixu
1
11
m
iijij xu
1
maxarg
![Page 41: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/41.jpg)
41
Algorithm Overview
Iteration Find a new pattern by graph mining with
weight u If all constraints are satisfied, break Add a new constraint Update u by LP1-Dual
Return Convert dual solution to obtain primal
solution α
![Page 42: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/42.jpg)
42
Speed-up by adding multiple patterns (multiple pricing)
So far, the most violated pattern is chosen
Mining and inclusion of top k patterns at each iteration Reduction of the number of mining
calls
m
iijij xu
1
maxarg
A Linear Programming Approach for Molecular QSAR Analysis
![Page 43: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/43.jpg)
43
Speed-up by multiple pricing
![Page 44: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/44.jpg)
44
Clearly negative data#atoms #bonds #rings … Activity
22 25 3
20 21 1.2
23 24 0.77
11 11 -3.52
21 22 -4
22 20 -10000
23 19 -10000
A Linear Programming Approach for Molecular QSAR Analysis
![Page 45: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/45.jpg)
45
Inclusion of clearly negative data
LP2-Primal
l: # of clearly negative data
z: predetermined upperbound
ξ’ : slack variable
![Page 46: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/46.jpg)
46
Experiments
Data from Endocrine Disruptors Knowledge Base 59 compounds labeled by real number and 61
compounds labeled by a large negative number
Label (target) is a log translated relative proliferative potency (log(RPP)) normalized between –1 and 1
Comparison with Marginalized Graph Kernel + ridge regression Marginalized Graph Kernel + kNN regression
![Page 47: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/47.jpg)
47
Results with or without clearly negative data
LP2
LP1
![Page 48: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/48.jpg)
48
Extracted patterns
Interpretable compared with implicitly expressed features by Marginalized Graph Kernel
![Page 49: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/49.jpg)
49
Summary (Graph Boosting)
Graph Boosting simultaneously generate patterns and learn their weightsFinite convergence by column generationPotentially interpretable by chemists.Flexible constraints and speed-up by LP.
![Page 50: 1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d2b5503460f94a00dd7/html5/thumbnails/50.jpg)
50
Concluding Remarks
Using graph mining as a part of machine learning algorithms Weights are essential Please include weights when you
implement your item-set/tree/graph mining algorithms
Make it available on the web! Then ML researchers can use it