new models for relational classification ricardo silva (statslab) joint work with wei chu and zoubin...

New Models for Relational Classification

Ricardo Silva (Statslab)

Joint work with Wei Chu and Zoubin Ghahramani

The talk

Classification with non-iid data A source of non-iidness: relational

information A new family of models, and what is

new Applications to classification of text

documents

The prediction problem

X

Y

Standard setup

X

Y

N Xnew

Ynew

Prediction with non-iid data

X1

Y1

Xnew

Ynew

X2

Y2

Where does the non-iid information come from?

Relations Links between data points

Webpage A links to Webpage B Movie A and Movie B are often rented together

Relations as data “Linked webpages are likely to present similar

content” “Movies that are rented together often have

correlated personal ratings”

The vanilla relational domain: time-series

Relations: “Yi precedes Yi + k”, k > 0 Dependencies: “Markov structure G”

Y1 Y2 Y3… …

A model for integrating link data

How to model the class labels dependencies?

Movies that are rented together often might have all other sources of common, unmeasured factors

These hidden common causes affect the ratings

Example

MovieFeatures(M1)

Rating(M1)

MovieFeatures(M2)

Rating(M2)

Same genre?

Both released in same year?

Same director?

Target same age groups?

Integrating link data

Of course, many of these common causes will be measured

Many will not Idea:

Postulate a hidden common cause structure, based on relations

Define a model Markov to this structure Design an adequate inference algorithm

Example: Political Books database

A network of books about recent US politics sold by the online bookseller Amazon.com Valdis Krebs, http://www.orgnet.com/

Relations: frequent co-purchasing of books by the same buyers Political inclination factors as the hidden

common causes

Political Books relations

Political Books database

Features: I collected the Amazon.com front page

for each of the books Bag-of-words, tf-idf features, normalized

to unity Task:

Binary classification: “liberal” or “not-liberal” books

43 liberal books out of 105

Contribution

We will show a classical multiple linear

regression model built a relational variation generalize with a more complex set of

independence constraints generalize it using Gaussian processes

Seemingly unrelated regression (Zellner,1962)

Y = (Y1, Y2), X = (X1, X2)

Suppose you regress Y1 ~ X1, X2 and X2 turns out to be useless Analogously for Y2 ~ X1, X2

(X1 vanishes) Suppose you regress

Y1 ~ X1, X2, Y2 And now every variable is a

relevant predictor

X1 X2

Y1

X

X1 X2

Y1

Y2

Graphically, with latents

Capital(GE)

Stock price(GE)

Capital(Westinghouse)

Stock price(Westinghouse)

Industry factor 1Industry factor 2

Industry factor k?

…

X:

Y:

The Directed Mixed Graph (DMG)

Capital(GE)

Stock price(GE)

Capital(Westinghouse)

Stock price(Westinghouse)

X:

Y:

Richardson (2003), Richardson and Spirtes (2002)

A new family of relational models

Inspired by SUR Structure: DMG graphs

Edges postulated from given relations

X1

Y1

Y3

Y4

Y2

Y5

X2 X3 X4 X5

Model for binary classification

Nonparametric Probit regression

Zero-mean Gaussian process prior over f( . )

P(yi = 1| xi) = P(y*(xi) > 0)

y*(xi) = f(xi) + i, i ~ N(0, 1)

Relational dependency model

Make {} dependent multivariate Gaussian

For convenience, decouple it into two error terms

= * +

Dependency model: the decomposition

= * +

Independent from each other

Marginally independent Dependent according to relations

=* +

Diagonal Not diagonal, with 0s onlyon unrelated pairs

Dependency model: the decomposition

If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply

y*(xi) = f(xi) + = f(xi) + + * = g(xi) + *

g(.) = K + *

Approximation

Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate

Approximate posterior with a Gaussian Expectation-Propagation (Minka, 2001)

The reason for * becomes apparent in the EP approximation

Approximation

Likelihood does not factorize over f( . ), but factorizes over g( . )

Approximate each factor p(yi | g(xi)) with a Gaussian if * were 0, yi would be a deterministic

function of g(xi)

p(g | x, y) p(g | x) p(yi | g(xi))i

Generalizations

This can be generalized for any number of relations

Y1

Y3

Y4

Y2

Y5

= * + 1 + 2 + 3

But how to parameterize ?

Non-trivial Desiderata:

Positive definite Zeroes on the right places Few parameters, but broad family Easy to compute

But how to parameterize ?

“Poking zeroes” on a positive definite matrix doesn’t work

Y1 Y2 Y3

1 0.8 0.8

0.8 1 0.8

0.8 0.8 1

1 0.8 0

0.8 1 0.8

0 0.8 1

positive definite not positive definite

Approach #1

Assume we can find all cliques for the bi-directed subgraph of relations

Create a “factor analysis model”, where for each clique Ci there is a latent variable Li

members of each clique are the only children of Li

Set of latents {L} is a set of N(0, 1) variables coefficients in the model are equal to 1

Approach #1

Y1 = L1 + 1

Y2 = L1 + L2 + 2

Y1

Y3

Y4

Y2

L1 L2

Y1 Y3Y2 Y4

Approach #1

In practice, we set the variance of each to a small constant (10-4)

Covariance between any two Ys is proportional to the number of cliques they

belong together inversely proportional to the number of

cliques they belong to individually

Approach #1

Let U be the correlation matrix obtained from the proposed procedure

To define the error covariance, use a single hyperparameter [0, 1]

*

=(I – Udiag) + U

Approach #1

Notice: if everybody is connected, model is exchangeable and simple

Y1

Y3

Y4

Y2

L1

Y1 Y3Y2 Y4

=1

1

1

1

Approach #1

Finding all cliques is “impossible”, what to do?

Triangulate and them extract cliques Can be done in polynomial time

This is a relaxation of the problem, since constraints are thrown away

Can have bad side effects: the “Blow-Up” effect

Political Books dataset

Political Books dataset:the “Blow-up” effect

Approach #2

Don’t look for cliques: create a latent for each pair of variables

Very fast to compute, zeroes respected

Y1

Y3

Y4

Y2

Y1

Y3

Y4

Y2

L13

L13

L13

L13

Approach #2

Correlations, however, are given by

Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common

We call this the “pulverization” effect

Sqrt(#neigh(i) . #neigh(j))

1Corr(i, j)

Political Books dataset

Political Books dataset:the “pulverization” effect

WebKB dataset: links of pages in University of Washington

Approach #1

Approach #2

Comparison:undirected models

Generative stories Conditional random fields (Lafferty,

McCallum, Pereira, 2001) Wei et al., 2006/Richardson and Spirtes,

2002;

X1

Y1 Y3Y2

X2 X3

Chu Wei’s model

Y1*

Y1 Y3Y2

Y2* Y3

*

X1 X2 X3

R12 = 1 R23 = 1

Dependency family equivalent to a pairwise Markov random field

Y1 Y3Y2

Properties of undirected models

MRFs propagate information among “test” points

Y1 Y7

Y6

Y5

Y8

Y10

Y9 Y12Y11

Y2 Y4Y3

Properties of DMG models

DMGs propagate information among “training” points

Y1 Y7

Y6

Y5

Y8

Y10

Y9 Y12Y11

Y2 Y4Y3


In a DMG, each “test” point will have in the Markov blanket a whole “training component”

Y1 Y7

Y6

Y5

Y8

Y10

Y9 Y12Y11

Y2 Y4Y3


It seems acceptable that a typical relational domain will not have a “extrapolation” pattern Like typical “structured output” problems,

e.g., NLP domains Ultimately, the choice of model

concerns the question: “Hidden common causes” or

“relational indicators”?

Experiment #1

A subset of the CORA database 4,285 machine learning papers, 7 classes Links: citations between papers

“hidden common cause” interpretation: particular ML subtopic being treated

Experiment: 7 binary classification problems, Class 5 vs. others

Criterion: AUC

Experiment #1

Comparisons: Regular GP Regular GP + citation adjacency matrix Chu Wei’s Relational GP (RGP) Our method, miXed graph GP (XGP)

Fairly easy task Analysis of low-sample tasks

Uses 1% of the data (roughly 10 data points for training)

Not that useful for XGP, but more useful for RGP

Experiment #1

Chu Wei’s method get up to 0.99 in several of those…

Experiment #2

Political Books database 105 datapoints, 100 runs using 50% for training

Comparison with standard Gaussian processes Linear kernels

Results 0.92 for regular GP 0.98 for XGP (using pairwise kernel generator)

Hyperparameters optimized by grid search Difference: 0.06 with std 0.02 Chu Wei’s method does the same…

Experiment #3

WebKB Collections of webpages from 4 different

universities Task: “outlier classification”

Identify which pages are not a student, course, project or faculty pages

10% for training data (still not that hard) However, an order of magnitude of more data

than in Cora

Experiment #3

As far as I know, XGP gets easily the best results on this task

Future work

Tons of possibilities on how to parameterize output covariance matrix Incorporating relation attributes too

Heteroscedastic relational noise Mixtures of relations New approximation algorithms Clustering problems On-line learning

Thank You

new models for relational classification ricardo silva (statslab) joint work with wei chu and zoubin...

Documents