combining labeled and unlabeled data for text categorization with a large number of categories
DESCRIPTION
Combining labeled and unlabeled data for text categorization with a large number of categories. Rayid Ghani KDD Lab Project. Supervised Learning with Labeled Data. Labeled data is required in large quantities and can be very expensive to collect. Why use Unlabeled data?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/1.jpg)
Combining labeled and unlabeled data for text categorization with a large number of
categories
Rayid Ghani
KDD Lab Project
![Page 2: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/2.jpg)
Supervised Learning with Labeled Data
Labeled data is required in large quantities and can be very expensive to collect.
![Page 3: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/3.jpg)
Why use Unlabeled data?
Very Cheap in the case of text Web Pages Newsgroups Email Messages
May not be equally useful as labeled data but is available in enormous quantities
![Page 4: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/4.jpg)
Goal
Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories
![Page 5: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/5.jpg)
•ECOCvery accurate and efficient for text categorization with a large number of classes
•Co-Traininguseful for combining labeled and unlabeled data with a small number of classes
![Page 6: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/6.jpg)
Related research with unlabeled data
Using EM in a generative model (Nigam et al. 1999)
Transductive SVMs (Joachims 1999) Co-Training type algorithms (Blum &
Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)
![Page 7: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/7.jpg)
What is ECOC?
Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)
Use a learner to learn the binary problems
![Page 8: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/8.jpg)
Training ECOC
0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1
ABCD
f1 f2 f3 f4 f5
X 1 1 1 1 0
Testing ECOC
![Page 9: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/9.jpg)
The Co-training algorithm
Loop (while unlabeled documents remain): Build classifiers A and B
Use Naïve Bayes Classify unlabeled documents with A & B
Use Naïve Bayes Add most confident A predictions and most
confident B predictions as labeled training examples
[Blum & Mitchell 1998]
![Page 10: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/10.jpg)
The Co-training Algorithm
Naïve Bayeson B
Naïve Bayeson A
Learn from labeled data
Estimate labels
Estimate labels
Select most confident
Select most confident
Add to labeled data
[Blum & Mitchell, 1998]
![Page 11: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/11.jpg)
One Intuition behind co-training
A and B are redundant A features independent of B features Co-training like learning with random
classification noise Most confident A prediction gives random B Small misclassification error for A
![Page 12: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/12.jpg)
ECOC + CoTraining = ECoTrain
ECOC decomposes multiclass problems into binary problems
Co-Training works great with binary problems
ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training
![Page 13: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/13.jpg)
SPORTSSPORTS SCIENCESCIENCEARTSARTS
HEALTHHEALTH POLITICSPOLITICS
LAWLAW
![Page 14: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/14.jpg)
What happens with sparse data?
Percent Decrease in Error with Training size and length of code
30
35
40
45
50
55
60
65
70
0 20 40 60 80 100
Training Size
% D
ecre
ase
in E
rro
r
15bit
31bit
63bit
![Page 15: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/15.jpg)
Datasets
Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%
Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11%
![Page 16: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/16.jpg)
0
2
4
6
8
10
12
Class
Perc
enta
ge
![Page 17: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/17.jpg)
Results
Dataset Naïve Bayes
(No UnLabeled Data)
ECOC
(No UnLabeled Data)
EM Co-Trainin
g
ECOC + Co-
Training
10% Labeled
100% Labeled
10% Labeled
100% Labeled
10% Labeled
10% Labeled
10% Labeled
Jobs-65 50.1 68.2 59.3 71.2 58.2 54.1 64.5
Hoovers-255
15.2 32.0 24.8 36.5 9.1 10.2 27.6
![Page 18: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/18.jpg)
Results
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100
Recall
Pre
cisi
on
ECOC + CoTrainNaive BayesEM
z
![Page 19: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/19.jpg)
What Next?
Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration
Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
![Page 20: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/20.jpg)
Summary
![Page 21: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/21.jpg)
The Co-training setting
…My advisor…
…Professor Blum…
…My grandson…
Tom Mitchell
Fredkin Professor of AI…
Avrim Blum
My research interests are…
JohnnyI like horsies!
Classifier A Classifier B
![Page 22: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/22.jpg)
Learning from Labeled and Unlabeled Data:
Using Feature Splits
Co-training [Blum & Mitchell 98] Meta-bootstrapping [Riloff & Jones 99] coBoost [Collins & Singer 99] Unsupervised WSD [Yarowsky 95]
Consider this the co-training setting
![Page 23: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/23.jpg)
Learning from Labeled and Unlabeled Data:
Extending supervised learning
MaxEnt Discrimination [Jaakkola et al. 99]
Expectation Maximization [Nigam et al. 98]
Transductive SVMs [Joachims 99]
![Page 24: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/24.jpg)
Using Unlabeled Data with EMEstimate labels of the unlabeled documents
Use all documents to build anew naïve Bayes classifier
Naïve Bayes
![Page 25: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/25.jpg)
Co-training vs. EM
Co-training Uses feature split Incremental labeling Hard labels
EM Ignores feature split Iterative labeling Probabilistic labels
Which differences matter?
![Page 26: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/26.jpg)
Hybrids of Co-training and EM
Yes No
Incremental co-training self-training
Iterative co-EM EM
Uses Feature Split?
Labeling
Naïve Bayeson A
Naïve Bayeson B
Label all Learn from all
Naïve Bayes
on A & B
Label all
Add only bestLabel allLearn from all
![Page 27: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/27.jpg)
Learning from Unlabeled Datausing Feature Splits
coBoost [Collins & Singer 99] Meta-bootstrapping [Riloff & Jones 99] Unsupervised WSD [Yarowsky 95] Co-training [Blum & Mitchell 98]
![Page 28: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/28.jpg)
Intuition behind Co-training
A and B are redundant A features independent of B features Co-training like learning with random
classification noise Most confident A prediction gives random B Small misclassification error for A
![Page 29: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/29.jpg)
Using Unlabeled Data with EM
Estimate labels of unlabeled documents
Use all documents to rebuild naïve Bayes classifier
Naïve Bayes
[Nigam, McCallum, Thrun & Mitchell, 1998]
Initially learn from labeled
only
![Page 30: Combining labeled and unlabeled data for text categorization with a large number of categories](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813294550346895d992a91/html5/thumbnails/30.jpg)
Co-EM
Naïve Bayeson A
Naïve Bayeson B
Estimate labels Build naïve Bayes with all data
Estimate labelsBuild naïve Bayes with all data
Use Feature Split?
EMco-EMLabel All
co-trainingLabel Few
NoYes
Initialize withlabeled data