combining labeled and unlabeled data for text categorization with a large number of categories

Combining labeled and unlabeled data for text categorization with a large number of

categories

Rayid Ghani

KDD Lab Project

Supervised Learning with Labeled Data

Labeled data is required in large quantities and can be very expensive to collect.

Why use Unlabeled data?

Very Cheap in the case of text Web Pages Newsgroups Email Messages

May not be equally useful as labeled data but is available in enormous quantities

Goal

Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories

•ECOCvery accurate and efficient for text categorization with a large number of classes

•Co-Traininguseful for combining labeled and unlabeled data with a small number of classes

Related research with unlabeled data

Using EM in a generative model (Nigam et al. 1999)

Transductive SVMs (Joachims 1999) Co-Training type algorithms (Blum &

Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)

What is ECOC?

Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)

Use a learner to learn the binary problems

Training ECOC

0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1

ABCD

f1 f2 f3 f4 f5

X 1 1 1 1 0

Testing ECOC

The Co-training algorithm

Loop (while unlabeled documents remain): Build classifiers A and B

Use Naïve Bayes Classify unlabeled documents with A & B

Use Naïve Bayes Add most confident A predictions and most

confident B predictions as labeled training examples

[Blum & Mitchell 1998]

The Co-training Algorithm

Naïve Bayeson B

Naïve Bayeson A

Learn from labeled data

Estimate labels

Estimate labels

Select most confident

Select most confident

Add to labeled data

[Blum & Mitchell, 1998]

One Intuition behind co-training

A and B are redundant A features independent of B features Co-training like learning with random

classification noise Most confident A prediction gives random B Small misclassification error for A

ECOC + CoTraining = ECoTrain

ECOC decomposes multiclass problems into binary problems

Co-Training works great with binary problems

ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

SPORTSSPORTS SCIENCESCIENCEARTSARTS

HEALTHHEALTH POLITICSPOLITICS

LAWLAW

What happens with sparse data?

Percent Decrease in Error with Training size and length of code

30

35

40

45

50

55

60

65

70

0 20 40 60 80 100

Training Size

% D

ecre

ase

in E

rro

r

15bit

31bit

63bit

Datasets

Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%

Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11%

0

2

4

6

8

10

12

Class

Perc

enta

ge

Results

Dataset Naïve Bayes

(No UnLabeled Data)

ECOC

(No UnLabeled Data)

EM Co-Trainin

g

ECOC + Co-

Training

10% Labeled

100% Labeled

10% Labeled

100% Labeled

10% Labeled

10% Labeled

10% Labeled

Jobs-65 50.1 68.2 59.3 71.2 58.2 54.1 64.5

Hoovers-255

15.2 32.0 24.8 36.5 9.1 10.2 27.6

Results

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100

Recall

Pre

cisi

on

ECOC + CoTrainNaive BayesEM

z

What Next?

Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration

Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Summary

The Co-training setting

…My advisor…

…Professor Blum…

…My grandson…

Tom Mitchell

Fredkin Professor of AI…

Avrim Blum

My research interests are…

JohnnyI like horsies!

Classifier A Classifier B

Learning from Labeled and Unlabeled Data:

Using Feature Splits

Co-training [Blum & Mitchell 98] Meta-bootstrapping [Riloff & Jones 99] coBoost [Collins & Singer 99] Unsupervised WSD [Yarowsky 95]

Consider this the co-training setting

Learning from Labeled and Unlabeled Data:

Extending supervised learning

MaxEnt Discrimination [Jaakkola et al. 99]

Expectation Maximization [Nigam et al. 98]

Transductive SVMs [Joachims 99]

Using Unlabeled Data with EMEstimate labels of the unlabeled documents

Use all documents to build anew naïve Bayes classifier

Naïve Bayes

Co-training vs. EM

Co-training Uses feature split Incremental labeling Hard labels

EM Ignores feature split Iterative labeling Probabilistic labels

Which differences matter?

Hybrids of Co-training and EM

Yes No

Incremental co-training self-training

Iterative co-EM EM

Uses Feature Split?

Labeling

Naïve Bayeson A

Naïve Bayeson B

Label all Learn from all

Naïve Bayes

on A & B

Label all

Add only bestLabel allLearn from all

Learning from Unlabeled Datausing Feature Splits

coBoost [Collins & Singer 99] Meta-bootstrapping [Riloff & Jones 99] Unsupervised WSD [Yarowsky 95] Co-training [Blum & Mitchell 98]

Intuition behind Co-training

A and B are redundant A features independent of B features Co-training like learning with random

classification noise Most confident A prediction gives random B Small misclassification error for A

Using Unlabeled Data with EM

Estimate labels of unlabeled documents

Use all documents to rebuild naïve Bayes classifier

Naïve Bayes

[Nigam, McCallum, Thrun & Mitchell, 1998]

Initially learn from labeled

only

Co-EM

Naïve Bayeson A

Naïve Bayeson B

Estimate labels Build naïve Bayes with all data

Estimate labelsBuild naïve Bayes with all data

Use Feature Split?

EMco-EMLabel All

co-trainingLabel Few

NoYes

Initialize withlabeled data

combining labeled and unlabeled data for text categorization with a large number of categories

Documents