graph-based consensus maximization among multiple supervised and unsupervised models jing gao 1,...

Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Jing Gao1, Feng Liang2, Wei Fan3, Yizhou Sun1, Jiawei Han1

1 CS UIUC2 STAT UIUC

3 IBM TJ Watson

NIPS’2009

2/46

Outline• An overview of ensemble methods

– Introduction– Supervised ensemble techniques– Unsupervised ensemble techniques

• Consensus among supervised and unsupervised models– Problem and motivation

– Methodology

– Interpretations

– Experiments

3/46

Ensemble

Data

……

model 1

model 2

model k

Ensemble model

Applications: classification, clustering, collaborative filtering, anomaly detection……

Combine multiple models into one!

4/46

Stories of Success• Million-dollar prize

– Improve the baseline movie recommendation approach of Netflix by 10% in accuracy

– The top submissions all combine several teams and algorithms as an ensemble

• Data mining competitions– Classification problems– Winning teams employ an

ensemble of classifiers

5/46

Why Ensemble Works? (1)• Intuition

– combining diverse, independent opinions in human decision-making as a protective mechanism (e.g. stock portfolio)

• Uncorrelated error reduction– Suppose we have 5 completely independent

classifiers for majority voting– If accuracy is 70% for each

• 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5) • 83.7% majority vote accuracy

– 101 such classifiers• 99.9% majority vote accuracy

from T. Holloway, Introduction to Ensemble Learning, 2007.

6/46

Why Ensemble Works? (2)

Model 1

Model 2Model 3

Model 4Model 5

Model 6

Some unknown distribution

Ensemble gives the global picture!

from W. Fan, Random Decision Tree.

7/46

Why Ensemble Works? (3)

• Overcome limitations of single hypothesis– The target function may not be implementable with individual

classifiers, but may be approximated by model averaging

Decision Tree Model Averagingfrom I. Davidson et. al., When Efficient Model Averaging Out-Performs Boosting and Bagging, ECML 06.

8/46

Research Focus

• Base models– Improve diversity!

• Combination scheme– Consensus (unsupervised)

– Learn to combine (supervised)

• Tasks– Classification (supervised ensemble)

– Clustering (unsupervised ensemble)

9/46




– Methodology

– Interpretations

– Experiments

10/46

Bagging• Bootstrap

– Sampling with replacement– Contains around 63.2% original records in each sample

• Ensemble– Train a classifier on each bootstrap sample

– Use majority voting to determine the class label of ensemble classifier

• Discussions– Incorporate diversity through bootstrap samples

– Sensitive base classifiers work better, such as decision tree

11/46

Boosting• Principles

– Boost a set of weak learners to a strong learner– Make records currently misclassified more important

• AdaBoost– Initially, set uniform weights on all the records

– At each round• Create a bootstrap sample based on the weights

• Train a classifier on the sample and apply it on the original training set• Records that are wrongly classified will have their weights increased• Records that are classified correctly will have their weights decreased• If the error rate is higher than 50%, start over

– Final prediction is weighted average of all the classifiers with weight representing the training accuracy

12/46

Classifications (colors) and Weights (size) after 1 iterationOf AdaBoost

3 iterations20 iterations

from Elder, John. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. 2007.

13/46

Random Forest• Algorithm

– For each tree• Choose a training set by choosing N times with

replacement from the training set

• For each node, randomly choose m<M features and calculate the best split

• Fully grown and not pruned

– Use majority voting among all the trees

• Discussions– Bagging+random features: improve diversity

14/46

B1: {0,1}

B2: {0,1}

B3: continuous

B2: {0,1}

B3: continuous

B2: {0,1}

B3: continuous

B3: continous

B1 == 0

B2 == 0?

Y

B3 < 0.3?

N

Y N

……… B3 < 0.6?

Random threshold 0.3

Random threshold 0.6

B1 chosen randomly

B2 chosen randomly

B3 chosen randomly

Random Decision Tree

from W. Fan, Random Decision Tree.

15/46




– Methodology

– Interpretations

– Experiments

16/46

Clustering Ensemble• Goal

– Combine “weak” clusterings to a better one

from A. Topchy et. al. Clustering Ensembles: Models of Consensus and Weak Partitions. PAMI, 2005

17/46

Methods• Base Models

– Bootstrap samples, different subsets of features– Different clustering algorithms– Random number of clusters

• Combination– find the correspondence between the labels in

the partitions and fuse the clusters with the same labels

– treat each output as a categorical variable and cluster in the new feature space

18/46

Meta Clustering (1)• Cluster-based

– Regard each cluster from a base model as a record

– Similarity is defined as the percentage of shared common examples

– Conduct meta-clustering and assign record to the associated meta-cluster

• Instance-based– Compute the similarity between two records as the percentage of models

that put them into the same cluster

v2

v4

v3

v1

v5

v6

c2 c5

c3c1

c7

c9

c4 c6

c8

c10

from A. Gionis et. al. Clustering Aggregation. TKDD, 2007

19/46

Meta Clustering (2)• Probability-based

– Assume output comes from a mixture of models– Use EM algorithm to learn the model

• Spectral clustering– Formulate the problem as a bipartite graph

– Use spectral clustering to partition the graph

v2 v4v3v1 v5 v6

c2 c5c3c1

c7 c8

c4 c6

c9 c10

from A. Gionis et. al. Clustering Aggregation. TKDD, 2007

20/46




– Methodology

– Interpretations

– Experiments

21/46

Multiple Source Classification

Image Categorization Like? Dislike? Research Area

images, descriptions, notes, comments, albums, tags…….

movie genres, cast, director, plots…….

users viewing history, movie ratings…

publication and co-authorship network, published papers, …….

22/46

Model Combination helps!

Some areas share similar keywordsSIGMOD

SDM

ICDM

KDD

EDBT

VLDB

ICML

AAAI

Tom

Jim

Lucy

Mike

Jack

Tracy

Cindy

Bob

Mary

Alice

People may publish in relevant but different areas

There may be cross-discipline co-operations

supervised

unsupervised

Supervised or unsupervised

23/46

Problem

24/46

Motivations• Consensus maximization

– Combine output of multiple supervised and unsupervised models on a set of objects

– The predicted labels should agree with the base models as much as possible

• Motivations– Unsupervised models provide useful constraints for

classification tasks

– Model diversity improves prediction accuracy and robustness

– Model combination at output level is needed due to privacy-preserving or incompatible formats

25/46

Related Work (1)

• Single models – Supervised: SVM, Logistic regression, ……– Unsupervised: K-means, spectral clustering, ……– Semi-supervised learning, transductive learning

• Supervised ensemble– Require raw data and labels: bagging, boosting, Bay

esian model averaging

– Require labels: mixture of experts, stacked generalization

– Majority voting works at output level and does not require labels

26/46

Related Work (2)

• Unsupervised ensemble – find a consensus clustering from multiple par

titionings without accessing the features

• Multi-view learning– a joint model is learnt from both labeled and

unlabeled data from multiple sources– it can be regarded as a semi-supervised ens

emble requiring access to the raw data

27/46

Related Work (3)

28/46




– Methodology

– Interpretations

– Experiments

29/46

A Toy Example

x7

x4

x5 x6

x1 x2

x3

x7

x4

x5 x6

x1 x2

x3

x7

x4

x5 x6

x1 x2

x3

x7

x4

x5 x6

x1 x2

x3

1

2

3

1

2

3

30/46

Groups-Objects

x7

x4

x5 x6

x1 x2

x3

x7

x4

x5 x6

x1 x2

x3

x7

x4

x5 x6

x1 x2

x3

x7

x4

x5 x6

x1 x2

x3

1

2

3

1

2

3

g1

g2

g3

g4

g5

g6

g7

g8

g9

g10

g11

g12

31/46

Bipartite Graph

],...,[ 1 icii uuu

],...,[ 1 jcjj qqq

iu

Groups Objects

M1

M3

jq

object i

group j

conditional prob vector

otherwise

qua jiij 0

1adjacency

initial probability

cg

g

y

j

j

j

]10...0[

............

1]0...01[

[1 0 0]

[0 1 0] [0 0 1]

M2

M4……

……

32/46

Objective

iu

Groups Objects

M1

M3

jq

[1 0 0]

[0 1 0] [0 0 1]

M2

M4……

……

minimize disagreement

)||||||||(min 2

11 1

2, jj

s

j

n

i

v

jjiijUQ yqqua

Similar conditional probability if the object is connected to the group

Do not deviate much from the initial probability

33/46

Methodology

iu

Groups Objects

M1

M3

jq

[1 0 0]

[0 1 0] [0 0 1]

M2

M4……

……

v

jij

v

jjij

i

a

qa

u

1

1

n

iij

j

n

iiij

j

a

yuaq

1

1

Iterate until convergence

Update probability of a group

Update probability of an object

n

iij

n

iiij

j

a

uaq

1

1

34/46




– Methodology

– Interpretations

– Experiments

35/46

Constrained Embedding

v

j

c

zn

i ij

n

i izijjzUQ

a

uaq

1 11

1,min

)||||||||(min 2

11 1

2, jj

s

j

n

i

v

jjiijUQ yqqua

objects

groups

constraints for groups from classification models

zislabelsgifq jjz '1

36/46

Ranking on Consensus Structure

iu

Groups Objects

M1

M3

jq

[1 0 0]

[0 1 0] [0 0 1]

M2

M4……

……

jq

1.11.11

1. )( yDqADADDq nT

v

query

adjacency matrix

personalized damping factors

37/46

Incorporating Labeled Information

iu

)||||||||(min 2

11 1

2, jj

s

j

n

i

v

jjiijUQ yqqua

Groups Objects

M1

M3

jq

[1 0 0]

[0 1 0] [0 0 1]

M2

M4……

……

v

jij

v

jjij

i

a

qa

u

1

1

n

iij

j

n

iiij

j

a

yuaq

1

1

Objective

Update probability of a group

Update probability of an object

l

iii fu

1

2||||

v

jij

v

jijij

i

a

fqa

u

1

1

n

iij

n

iiij

j

a

uaq

1

1

38/46




– Methodology

– Interpretations

– Experiments

39/46

Experiments-Data Sets

• 20 Newsgroup– newsgroup messages categorization– only text information available

• Cora– research paper area categorization– paper abstracts and citation information available

• DBLP– researchers area prediction– publication and co-authorship network, and

publication content– conferences’ areas are known

40/46

Experiments-Baseline Methods (1)

• Single models– 20 Newsgroup:

• logistic regression, SVM, K-means, min-cut

– Cora• abstracts, citations (with or without a labeled set)

– DBLP• publication titles, links (with or without labels from conferences)

• Proposed method– BGCM– BGCM-L: semi-supervised version combining four models– 2-L: two models– 3-L: three models

41/46

Experiments-Baseline Methods (2)

• Ensemble approaches– clustering ensemble on all of the four models-

MCLA, HBGF

SingleModels

Ensemble atRaw Data

Ensemble at Output

Level

K-means, Spectral Clustering,

…...

Semi-supervised Learning,

Collective Inference

SVM, Logistic Regression,

…...

Multi-view Learning

Bagging, Boosting, Bayesian

model averaging,

…...

Unsupervised Learning

Supervised Learning

Semi-supervised Learning

Clustering Ensemble

Consensus Maximization

Majority Voting

Mixture of Experts, Stacked

Generalization

42/46

Accuracy (1)

43/46

Accuracy (2)

45/46

Conclusions• Ensemble

– Combining independent, diversified models improves accuracy

– Information explosion, various learning packages available

• Consensus Maximization– Combine the complementary predictive powers of

multiple supervised and unsupervised models– Propagate labeled information between group and

object nodes iteratively over a bipartite graph– Two interpretations: constrained embedding and

ranking on consensus structure• Applications

– Multiple source learning, Ranking, Truth Finding……

46/46

Thanks!

• Any questions?

http://www.ews.uiuc.edu/~jinggao3/nips09bgcm.htm

[email protected]

http://www.ews.uiuc.edu/~jinggao3/nips09bgcm.htm

mailto:[email protected]

graph-based consensus maximization among multiple supervised and unsupervised models jing gao 1,...

Documents

ensemble learning

unified overview of

unsupervised modelsproblem

independent classifiers

weights size

efficient model

individual classifiers

weights decreasedif