new models for relational classification ricardo silva (statslab) joint work with wei chu and zoubin...
TRANSCRIPT
New Models for Relational Classification
Ricardo Silva (Statslab)
Joint work with Wei Chu and Zoubin Ghahramani
The talk
Classification with non-iid data A source of non-iidness: relational
information A new family of models, and what is
new Applications to classification of text
documents
The prediction problem
X
Y
Standard setup
X
Y
N Xnew
Ynew
Prediction with non-iid data
X1
Y1
Xnew
Ynew
X2
Y2
Where does the non-iid information come from?
Relations Links between data points
Webpage A links to Webpage B Movie A and Movie B are often rented together
Relations as data “Linked webpages are likely to present similar
content” “Movies that are rented together often have
correlated personal ratings”
The vanilla relational domain: time-series
Relations: “Yi precedes Yi + k”, k > 0 Dependencies: “Markov structure G”
Y1 Y2 Y3… …
A model for integrating link data
How to model the class labels dependencies?
Movies that are rented together often might have all other sources of common, unmeasured factors
These hidden common causes affect the ratings
Example
MovieFeatures(M1)
Rating(M1)
MovieFeatures(M2)
Rating(M2)
Same genre?
Both released in same year?
Same director?
Target same age groups?
Integrating link data
Of course, many of these common causes will be measured
Many will not Idea:
Postulate a hidden common cause structure, based on relations
Define a model Markov to this structure Design an adequate inference algorithm
Example: Political Books database
A network of books about recent US politics sold by the online bookseller Amazon.com Valdis Krebs, http://www.orgnet.com/
Relations: frequent co-purchasing of books by the same buyers Political inclination factors as the hidden
common causes
Political Books relations
Political Books database
Features: I collected the Amazon.com front page
for each of the books Bag-of-words, tf-idf features, normalized
to unity Task:
Binary classification: “liberal” or “not-liberal” books
43 liberal books out of 105
Contribution
We will show a classical multiple linear
regression model built a relational variation generalize with a more complex set of
independence constraints generalize it using Gaussian processes
Seemingly unrelated regression (Zellner,1962)
Y = (Y1, Y2), X = (X1, X2)
Suppose you regress Y1 ~ X1, X2 and X2 turns out to be useless Analogously for Y2 ~ X1, X2
(X1 vanishes) Suppose you regress
Y1 ~ X1, X2, Y2 And now every variable is a
relevant predictor
X1 X2
Y1
X
X1 X2
Y1
Y2
Graphically, with latents
Capital(GE)
Stock price(GE)
Capital(Westinghouse)
Stock price(Westinghouse)
Industry factor 1Industry factor 2
Industry factor k?
…
X:
Y:
The Directed Mixed Graph (DMG)
Capital(GE)
Stock price(GE)
Capital(Westinghouse)
Stock price(Westinghouse)
X:
Y:
Richardson (2003), Richardson and Spirtes (2002)
A new family of relational models
Inspired by SUR Structure: DMG graphs
Edges postulated from given relations
X1
Y1
Y3
Y4
Y2
Y5
X2 X3 X4 X5
Model for binary classification
Nonparametric Probit regression
Zero-mean Gaussian process prior over f( . )
P(yi = 1| xi) = P(y*(xi) > 0)
y*(xi) = f(xi) + i, i ~ N(0, 1)
Relational dependency model
Make {} dependent multivariate Gaussian
For convenience, decouple it into two error terms
= * +
Dependency model: the decomposition
= * +
Independent from each other
Marginally independent Dependent according to relations
=* +
Diagonal Not diagonal, with 0s onlyon unrelated pairs
Dependency model: the decomposition
If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply
y*(xi) = f(xi) + = f(xi) + + * = g(xi) + *
g(.) = K + *
Approximation
Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate
Approximate posterior with a Gaussian Expectation-Propagation (Minka, 2001)
The reason for * becomes apparent in the EP approximation
Approximation
Likelihood does not factorize over f( . ), but factorizes over g( . )
Approximate each factor p(yi | g(xi)) with a Gaussian if * were 0, yi would be a deterministic
function of g(xi)
p(g | x, y) p(g | x) p(yi | g(xi))i
Generalizations
This can be generalized for any number of relations
Y1
Y3
Y4
Y2
Y5
= * + 1 + 2 + 3
But how to parameterize ?
Non-trivial Desiderata:
Positive definite Zeroes on the right places Few parameters, but broad family Easy to compute
But how to parameterize ?
“Poking zeroes” on a positive definite matrix doesn’t work
Y1 Y2 Y3
1 0.8 0.8
0.8 1 0.8
0.8 0.8 1
1 0.8 0
0.8 1 0.8
0 0.8 1
positive definite not positive definite
Approach #1
Assume we can find all cliques for the bi-directed subgraph of relations
Create a “factor analysis model”, where for each clique Ci there is a latent variable Li
members of each clique are the only children of Li
Set of latents {L} is a set of N(0, 1) variables coefficients in the model are equal to 1
Approach #1
Y1 = L1 + 1
Y2 = L1 + L2 + 2
Y1
Y3
Y4
Y2
L1 L2
Y1 Y3Y2 Y4
Approach #1
In practice, we set the variance of each to a small constant (10-4)
Covariance between any two Ys is proportional to the number of cliques they
belong together inversely proportional to the number of
cliques they belong to individually
Approach #1
Let U be the correlation matrix obtained from the proposed procedure
To define the error covariance, use a single hyperparameter [0, 1]
*
=(I – Udiag) + U
Approach #1
Notice: if everybody is connected, model is exchangeable and simple
Y1
Y3
Y4
Y2
L1
Y1 Y3Y2 Y4
=1
1
1
1
Approach #1
Finding all cliques is “impossible”, what to do?
Triangulate and them extract cliques Can be done in polynomial time
This is a relaxation of the problem, since constraints are thrown away
Can have bad side effects: the “Blow-Up” effect
Political Books dataset
Political Books dataset:the “Blow-up” effect
Approach #2
Don’t look for cliques: create a latent for each pair of variables
Very fast to compute, zeroes respected
Y1
Y3
Y4
Y2
Y1
Y3
Y4
Y2
L13
L13
L13
L13
Approach #2
Correlations, however, are given by
Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common
We call this the “pulverization” effect
Sqrt(#neigh(i) . #neigh(j))
1Corr(i, j)
Political Books dataset
Political Books dataset:the “pulverization” effect
WebKB dataset: links of pages in University of Washington
Approach #1
Approach #2
Comparison:undirected models
Generative stories Conditional random fields (Lafferty,
McCallum, Pereira, 2001) Wei et al., 2006/Richardson and Spirtes,
2002;
X1
Y1 Y3Y2
X2 X3
Chu Wei’s model
Y1*
Y1 Y3Y2
Y2* Y3
*
X1 X2 X3
R12 = 1 R23 = 1
Dependency family equivalent to a pairwise Markov random field
Y1 Y3Y2
Properties of undirected models
MRFs propagate information among “test” points
Y1 Y7
Y6
Y5
Y8
Y10
Y9 Y12Y11
Y2 Y4Y3
Properties of DMG models
DMGs propagate information among “training” points
Y1 Y7
Y6
Y5
Y8
Y10
Y9 Y12Y11
Y2 Y4Y3
Properties of DMG models
In a DMG, each “test” point will have in the Markov blanket a whole “training component”
Y1 Y7
Y6
Y5
Y8
Y10
Y9 Y12Y11
Y2 Y4Y3
Properties of DMG models
It seems acceptable that a typical relational domain will not have a “extrapolation” pattern Like typical “structured output” problems,
e.g., NLP domains Ultimately, the choice of model
concerns the question: “Hidden common causes” or
“relational indicators”?
Experiment #1
A subset of the CORA database 4,285 machine learning papers, 7 classes Links: citations between papers
“hidden common cause” interpretation: particular ML subtopic being treated
Experiment: 7 binary classification problems, Class 5 vs. others
Criterion: AUC
Experiment #1
Comparisons: Regular GP Regular GP + citation adjacency matrix Chu Wei’s Relational GP (RGP) Our method, miXed graph GP (XGP)
Fairly easy task Analysis of low-sample tasks
Uses 1% of the data (roughly 10 data points for training)
Not that useful for XGP, but more useful for RGP
Experiment #1
Chu Wei’s method get up to 0.99 in several of those…
Experiment #2
Political Books database 105 datapoints, 100 runs using 50% for training
Comparison with standard Gaussian processes Linear kernels
Results 0.92 for regular GP 0.98 for XGP (using pairwise kernel generator)
Hyperparameters optimized by grid search Difference: 0.06 with std 0.02 Chu Wei’s method does the same…
Experiment #3
WebKB Collections of webpages from 4 different
universities Task: “outlier classification”
Identify which pages are not a student, course, project or faculty pages
10% for training data (still not that hard) However, an order of magnitude of more data
than in Cora
Experiment #3
As far as I know, XGP gets easily the best results on this task
Future work
Tons of possibilities on how to parameterize output covariance matrix Incorporating relation attributes too
Heteroscedastic relational noise Mixtures of relations New approximation algorithms Clustering problems On-line learning
Thank You