semi-supervised single-label text categorization using...
TRANSCRIPT
![Page 1: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/1.jpg)
Semi-supervised Single-label Text Categorizationusing Centroid-based Classifiers
Ana Cardoso-Cachopo Arlindo Oliveira
Instituto Superior Tecnico — Technical University of Lisbon / INESC-ID
SAC-IAR 2007, March 12th
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 1 / 19
![Page 2: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/2.jpg)
Outline
1 Problem Description
2 Characteristics of the Datasets
3 Why use Centroid-based Methods
4 Why use Unlabeled Data
5 Incorporate Unlabeled Data using EM
6 Incrementally Incorporate Unlabeled Data
7 Experimental Results
8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19
![Page 3: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/3.jpg)
Outline
1 Problem Description
2 Characteristics of the Datasets
3 Why use Centroid-based Methods
4 Why use Unlabeled Data
5 Incorporate Unlabeled Data using EM
6 Incrementally Incorporate Unlabeled Data
7 Experimental Results
8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19
![Page 4: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/4.jpg)
Outline
1 Problem Description
2 Characteristics of the Datasets
3 Why use Centroid-based Methods
4 Why use Unlabeled Data
5 Incorporate Unlabeled Data using EM
6 Incrementally Incorporate Unlabeled Data
7 Experimental Results
8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19
![Page 5: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/5.jpg)
Outline
1 Problem Description
2 Characteristics of the Datasets
3 Why use Centroid-based Methods
4 Why use Unlabeled Data
5 Incorporate Unlabeled Data using EM
6 Incrementally Incorporate Unlabeled Data
7 Experimental Results
8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19
![Page 6: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/6.jpg)
Outline
1 Problem Description
2 Characteristics of the Datasets
3 Why use Centroid-based Methods
4 Why use Unlabeled Data
5 Incorporate Unlabeled Data using EM
6 Incrementally Incorporate Unlabeled Data
7 Experimental Results
8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19
![Page 7: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/7.jpg)
Outline
1 Problem Description
2 Characteristics of the Datasets
3 Why use Centroid-based Methods
4 Why use Unlabeled Data
5 Incorporate Unlabeled Data using EM
6 Incrementally Incorporate Unlabeled Data
7 Experimental Results
8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19
![Page 8: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/8.jpg)
Outline
1 Problem Description
2 Characteristics of the Datasets
3 Why use Centroid-based Methods
4 Why use Unlabeled Data
5 Incorporate Unlabeled Data using EM
6 Incrementally Incorporate Unlabeled Data
7 Experimental Results
8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19
![Page 9: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/9.jpg)
Outline
1 Problem Description
2 Characteristics of the Datasets
3 Why use Centroid-based Methods
4 Why use Unlabeled Data
5 Incorporate Unlabeled Data using EM
6 Incrementally Incorporate Unlabeled Data
7 Experimental Results
8 Conclusions and Future Work(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 2 / 19
![Page 10: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/10.jpg)
Problem Description
Text Categorization
Single-label
DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19
![Page 11: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/11.jpg)
Problem Description
Text Categorization
Single-label
DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19
![Page 12: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/12.jpg)
Problem Description
Text Categorization
Single-label
DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19
![Page 13: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/13.jpg)
Problem Description
Text Categorization
Single-label
DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19
![Page 14: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/14.jpg)
Problem Description
Text Categorization
Single-label
DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19
![Page 15: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/15.jpg)
Problem Description
Text Categorization
Single-label
DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19
![Page 16: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/16.jpg)
Problem Description
Text Categorization
Single-label
DatasetsI Reuters 21578 - R8I 20 Newsgroups - 20NgI Web Knowledge Base - Web4I Cade - Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 3 / 19
![Page 17: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/17.jpg)
Characteristics of the Datasets
Train Test Total Smallest LargestDocs Docs Docs Class Class
R8 5485 2189 7674 51 3923
20Ng 11293 7528 18821 628 999
Web4 2803 1396 4199 504 1641
Cade12 27322 13661 40983 625 8473
Numbers of documents for the datasets: number of training documents,number of test documents, total number of documents, number ofdocuments in the smallest class, and number of documents in the largestclass.
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 4 / 19
![Page 18: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/18.jpg)
Why use Centroid-based Methods
Very fast
Good Accuracy
0.0
0.2
0.4
0.6
0.8
1.0Centroid
SVM
k-NN
LSI
Vector
Dumb
R8 20Ng Web4 Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19
![Page 19: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/19.jpg)
Why use Centroid-based Methods
Very fast
Good Accuracy
0.0
0.2
0.4
0.6
0.8
1.0Centroid
SVM
k-NN
LSI
Vector
Dumb
R8 20Ng Web4 Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19
![Page 20: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/20.jpg)
Why use Centroid-based Methods
Very fast
Good Accuracy
0.0
0.2
0.4
0.6
0.8
1.0Centroid
SVM
k-NN
LSI
Vector
Dumb
R8 20Ng Web4 Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19
![Page 21: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/21.jpg)
Why use Centroid-based Methods
Very fast
Good Accuracy
0.0
0.2
0.4
0.6
0.8
1.0Centroid
SVM
k-NN
LSI
Vector
Dumb
R8 20Ng Web4 Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19
![Page 22: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/22.jpg)
Why use Centroid-based Methods
Very fast
Good Accuracy
0.0
0.2
0.4
0.6
0.8
1.0Centroid
SVM
k-NN
LSI
Vector
Dumb
R8 20Ng Web4 Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19
![Page 23: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/23.jpg)
Why use Centroid-based Methods
Very fast
Good Accuracy
0.0
0.2
0.4
0.6
0.8
1.0Centroid
SVM
k-NN
LSI
Vector
Dumb
R8 20Ng Web4 Cade12
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 5 / 19
![Page 24: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/24.jpg)
Why use Unlabeled Data
Small amounts of labeled data available
Large amounts of unlabeled data available
Hard or expensive to label new data
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 6 / 19
![Page 25: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/25.jpg)
Why use Unlabeled Data
Small amounts of labeled data available
Large amounts of unlabeled data available
Hard or expensive to label new data
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 6 / 19
![Page 26: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/26.jpg)
Why use Unlabeled Data
Small amounts of labeled data available
Large amounts of unlabeled data available
Hard or expensive to label new data
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 6 / 19
![Page 27: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/27.jpg)
Incorporate Unlabeled Data using EM
If the entire dataset is available from the start, like in a library.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19
![Page 28: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/28.jpg)
Incorporate Unlabeled Data using EM
If the entire dataset is available from the start, like in a library.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19
![Page 29: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/29.jpg)
Incorporate Unlabeled Data using EM
If the entire dataset is available from the start, like in a library.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19
![Page 30: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/30.jpg)
Incorporate Unlabeled Data using EM
If the entire dataset is available from the start, like in a library.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19
![Page 31: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/31.jpg)
Incorporate Unlabeled Data using EM
If the entire dataset is available from the start, like in a library.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19
![Page 32: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/32.jpg)
Incorporate Unlabeled Data using EM
If the entire dataset is available from the start, like in a library.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19
![Page 33: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/33.jpg)
Incorporate Unlabeled Data using EM
If the entire dataset is available from the start, like in a library.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Estimation step: For each unlabeled document dj ∈ U, classify itaccording to the available centroids.Maximization step: For each class cj , update its centroid −−→cjnew ,considering the labeled documents and the labels for the unlabeleddocuments obtained in the previous step.Iterate: Until the centroids do not change in two consecutive iterations.Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 7 / 19
![Page 34: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/34.jpg)
Incrementally Incorporate Unlabeled Data
If the dataset changes over time, like a news feed or the web.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:
Classify dj according to its similarity to each of the centroids.
Update the centroids with the new document dj classified in theprevious step.
Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19
![Page 35: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/35.jpg)
Incrementally Incorporate Unlabeled Data
If the dataset changes over time, like a news feed or the web.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:
Classify dj according to its similarity to each of the centroids.
Update the centroids with the new document dj classified in theprevious step.
Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19
![Page 36: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/36.jpg)
Incrementally Incorporate Unlabeled Data
If the dataset changes over time, like a news feed or the web.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:
Classify dj according to its similarity to each of the centroids.
Update the centroids with the new document dj classified in theprevious step.
Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19
![Page 37: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/37.jpg)
Incrementally Incorporate Unlabeled Data
If the dataset changes over time, like a news feed or the web.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:
Classify dj according to its similarity to each of the centroids.
Update the centroids with the new document dj classified in theprevious step.
Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19
![Page 38: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/38.jpg)
Incrementally Incorporate Unlabeled Data
If the dataset changes over time, like a news feed or the web.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:
Classify dj according to its similarity to each of the centroids.
Update the centroids with the new document dj classified in theprevious step.
Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19
![Page 39: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/39.jpg)
Incrementally Incorporate Unlabeled Data
If the dataset changes over time, like a news feed or the web.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:
Classify dj according to its similarity to each of the centroids.
Update the centroids with the new document dj classified in theprevious step.
Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19
![Page 40: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/40.jpg)
Incrementally Incorporate Unlabeled Data
If the dataset changes over time, like a news feed or the web.
Inputs: A set of labeled document vectors, L, and a set of unlabeleddocument vectors U.Initialization step: For each class cj appearing in L, determine the class´scentroid −→cj , using one of the formulas for the centroids and consideringonly the labeled documents.Iterate: For each unlabeled document dj ∈ U:
Classify dj according to its similarity to each of the centroids.
Update the centroids with the new document dj classified in theprevious step.
Outputs: For each class cj , the centroid −→cj .
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 8 / 19
![Page 41: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/41.jpg)
Experimental Results - Synthetic Dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-5 -4 -3 -2 -1 0 1 2 3 4 5
Gaussian Distributionsµ1 = 1.0, σ1 = 1.0
µ2 = −1.0, σ2 = 1.0
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 9 / 19
![Page 42: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/42.jpg)
Experimental Results - Synthetic Dataset
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0 5 10 15 20
Acc
ura
cy
Labeled documents per class
µ1 = 1.0, σ1 = 1.0, µ2 = −1.0, σ2 = 1.0
CentroidEMInc
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-5 -4 -3 -2 -1 0 1 2 3 4 5
Gaussian Distributionsµ1 = 1.0, σ1 = 1.0
µ2 = −1.0, σ2 = 1.0
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 10 / 19
![Page 43: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/43.jpg)
Experimental Results - Synthetic Dataset
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0 5 10 15 20
Acc
ura
cy
Labeled documents per class
µ1 = 1.0, σ1 = 2.0, µ2 = −1.0, σ2 = 2.0
CentroidEMInc
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-5 -4 -3 -2 -1 0 1 2 3 4 5
Gaussian Distributionsµ1 = 1.0, σ1 = 2.0
µ2 = −1.0, σ2 = 2.0
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 11 / 19
![Page 44: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/44.jpg)
Experimental Results - Synthetic Dataset
0.52
0.53
0.54
0.55
0.56
0.57
0.58
0.59
0 5 10 15 20
Acc
ura
cy
Labeled documents per class
µ1 = 1.0, σ1 = 4.0, µ2 = −1.0, σ2 = 4.0
CentroidEMInc
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-5 -4 -3 -2 -1 0 1 2 3 4 5
Gaussian Distributionsµ1 = 1.0, σ1 = 4.0
µ2 = −1.0, σ2 = 4.0
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 12 / 19
![Page 45: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/45.jpg)
Experimental Results - Real World Datasets
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
0 5 10 15 20 25 30 35 40
Acc
ura
cy
Labeled documents per class
R8
CentroidEMInc
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 13 / 19
![Page 46: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/46.jpg)
Experimental Results - Real World Datasets
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70
Acc
ura
cy
Labeled documents per class
20Ng
CentroidEMInc
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 14 / 19
![Page 47: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/47.jpg)
Experimental Results - Real World Datasets
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70
Acc
ura
cy
Labeled documents per class
Web4
CentroidEMInc
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 15 / 19
![Page 48: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/48.jpg)
Experimental Results - Real World Datasets
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
Acc
ura
cy
Labeled documents per class
Cade12
CentroidEMInc
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 16 / 19
![Page 49: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/49.jpg)
Experimental Results - Real World Datasets
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
0 5 10 15 20 25 30 35 40
Acc
ura
cy
Labeled documents per class
R8
CentroidEMInc
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70
Acc
ura
cy
Labeled documents per class
20Ng
CentroidEMInc
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70
Acc
ura
cy
Labeled documents per class
Web4
CentroidEMInc
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
Acc
ura
cy
Labeled documents per class
Cade12
CentroidEMInc
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 17 / 19
![Page 50: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/50.jpg)
Conclusions and Future Work
If the initial model of the data is sufficiently precise, using unlabeleddata improves performance.
Using unlabeled data degrades performance if the initial model is notprecise enough.
As future work, we plan to extend this approach to multi-labeldatasets.
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 18 / 19
![Page 51: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/51.jpg)
Conclusions and Future Work
If the initial model of the data is sufficiently precise, using unlabeleddata improves performance.
Using unlabeled data degrades performance if the initial model is notprecise enough.
As future work, we plan to extend this approach to multi-labeldatasets.
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 18 / 19
![Page 52: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/52.jpg)
Conclusions and Future Work
If the initial model of the data is sufficiently precise, using unlabeleddata improves performance.
Using unlabeled data degrades performance if the initial model is notprecise enough.
As future work, we plan to extend this approach to multi-labeldatasets.
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 18 / 19
![Page 53: Semi-supervised Single-label Text Categorization using ...web.ist.utl.pt/~acardoso/docs/2007-SAC-IAR-semisupervised-presentation.pdf · Semi-supervised Single-label Text Categorization](https://reader034.vdocuments.us/reader034/viewer/2022042022/5e7a40488251c460fb5f44f8/html5/thumbnails/53.jpg)
Thank You.
Any Questions?
(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo SAC-IAR 2007, March 12th 19 / 19