clustering tagged documents with labeled and unlabeled documents

12
Intelligent Database Systems Presenter : JIAN-REN CHEN Authors : Chien-Liang Liu*, Wen-Hoar Hsaio, Chia-Hoang Lee, Chun-Hsien Chen 2013 , IPM Clustering tagged documents with labeled and unlabeled documents

Upload: kosey

Post on 24-Feb-2016

61 views

Category:

Documents


0 download

DESCRIPTION

Clustering tagged documents with labeled and unlabeled documents. Presenter : Jian-Ren Chen Authors : Chien -Liang Liu*, Wen -Hoar Hsaio , Chia -Hoang Lee, Chun- Hsien Chen 2013 , IPM. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Presenter : JIAN-REN CHEN

Authors : Chien-Liang Liu*, Wen-Hoar Hsaio, Chia-Hoang Lee,

   Chun-Hsien Chen

2013 , IPM

Clustering tagged documents with labeled and unlabeled documents

Page 2: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments

Page 3: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

MotivationTags can provide semantic information about the resources and

they can help machines perform the classification or clustering

tasks accurately.

Probabilistic latent semantic analysis (PLSA)

- aspect model

- statistical clustering model

Page 4: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

ObjectivesThis study employs Constrained-PLSA to cluster tagged documents

with a small amount of seeds.

The Constrained-PLSA is based on statistical clustering model

rather than aspect model.

Page 5: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Methodology - PLSA

Terms (keywords) of the document collection

documents

E-step

M-step

Page 6: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Methodology - Constrained-PLSAE-step

M-step

Page 7: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Experiments - Data set A (CiteULike)

Page 8: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Experiments (Data set A)

Page 9: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Experiments - Data set B (CiteULike)

Page 10: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Experiments (Data set B)

Page 11: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Conclusions• The performance of ‘‘tags as words’’ representation scheme is

more stable than ‘‘words + tags’’ representation scheme.

• Unsupervised learning methods fail to function properly in

the data set with noisy information, but Constrained-PLSA

function properly and stable even though only a small amount

of labeled data is available.

Page 12: Clustering tagged documents with  labeled and unlabeled documents

Intelligent Database Systems Lab

Comments• Advantages

- Constrained-PLSA outperforms the other methods• Disadvantage

- too much artificial processing in experiment• Applications- text mining- tagged document clustering