advisor: hsin-hsi chen reporter: y.h chang 2008-03-21
DESCRIPTION
Efficient Topic-based Unsupervised Name Disambiguation Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles JCDL2007. Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21. Outline. Introductoin Related Work Method Topic-based PLSA (Probabilistic Latent Semantic Analysis) - PowerPoint PPT PresentationTRANSCRIPT
Efficient Topic-based Unsupervised Name Disambiguation
Yang Song, Jian Huang, Isaac G. Councill,Jia Li, C. Lee Giles
JCDL2007
Advisor: Hsin-Hsi ChenReporter: Y.H Chang
2008-03-21
2008/03/21 Y.H Chang 2
Outline
Introductoin Related Work Method
Topic-based PLSA (Probabilistic Latent Semantic Analysis) Topic-based LDA (Latent Dirichlet Allocation) Clustering
Experiment Conclusion
2008/03/21 Y.H Chang 3
Introductoin
Name ambiguity Sharing same name, misspelling, name abbreviations
Searching Google for “Yang Song”: 1st page shows five different people’s home pages
In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents.
2008/03/21 Y.H Chang 4
Introductoin
MethodLearning topic-name matrix
by PLSA and LDA(feature set)
Topic disambiguate with agglomerative clustering method
In similar topic:generate name-name matrix
People disambiguate with another agglomerative clustering method
2008/03/21 Y.H Chang 5
Outline
Introductoin Related Work Method
Topic-based PLSA Topic-based LDA Clustering
Experiment Conclusion
2008/03/21 Y.H Chang 6
Related Work [19]G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguati
on. 2003 (transitivity problem) [9]H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations
using a k-way spectral clustering method. 2005 (complexity O(N2)) [12]J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for l
arge-scale databases. 2006 [2]I. Bhattacharya and L. Getoor. A latent dirichlet model forunsupervised ent
ity resolution. 2006 The aforementioned work mainly tackled the name disambiguati
on problem using the metadata records of the authors. This paper solves the name disambiguation problem in a novel way, by accounting for the topic distribution of the authors and adopting unsupervised methods.
2008/03/21 Y.H Chang 7
Outline
Introductoin Related Work Method
Topic-based PLSA Topic-based LDA Clustering
Experiment Conclusion
Method
Learning topic-name matrixby PLSA and LDA
(feature set)
Topic disambiguate with agglomerative clustering method
… … … …
2008/03/21 Y.H Chang 8
PLSA
From a statistical point of view, (1999) Hofmann presented an alternative to LSA, or Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) , which discovers sets of latent variables.
The model is described as an aspect model, assuming the existence of hidden factors underlying the co-occurrences among two sets of objects.
2008/03/21 Y.H Chang 9
PLSA
The goal of model fitting for PLSA is to estimate the parameters P(z),P(a|z), P(z|d),P(w|z), given a set of observations (d, a,w). The standard way to estimate the probability values is the Expectation-Maximization (EM) algorithm
z: topic of document
Size=K
People’s name
Word
document
2008/03/21 Y.H Chang 10
PLSA
2008/03/21 Y.H Chang 11
PLSA-Predicting New Name Appearances
Additionally, there is no natural way to assign probability to new documents.
Therefore, to predict the topics of new documents (with potentially new names) after training, the estimated P(w|z) parameters are used to estimate P(a|z) for new names a in test document dnew through a “folding-in” process.
Specifically, the E-step is the same as equation (4); however, the M-step maintains the original P(w|z) and only updates P(a|z) as well as P(z|d).
2008/03/21 Y.H Chang 12
LDA
(2003) Blei et al. introduced a Bayesian hierarchical model, Latent Dirichlet Allocation (LDA) , in which each document has its own topic distribution, drawn from a conjugate Dirichlet prior that remains the same for all documents in a collection.
2008/03/21 Y.H Chang 13
LDA
In our model, names (authors) and words are not directly related, i.e., each topic can generate a set of names and a set of words simultaneously with different probabilities, allowing more freedom to the model in parameter estimation.
a multinomial distribution φz for each topic z
a multinomial Distribution θd
a topic zdi from the multinomial distribution θd a name adi from the
multinomial distribution λzdi
a word wdi from the multinomial distribution φzdi
2008/03/21 Y.H Chang 14
LDA
In the following section, we apply the Gibbs sampling framework to get around the intractability problem of parameter estimation.
2008/03/21 Y.H Chang 15
Gibbs sampling for the LDA model
Note that in our case, we do not estimate the parameters α, β and λ. For simplicity and performance, they are fixed at 50/K, 0.01 and 0.1 respectively.
2008/03/21 Y.H Chang 16
ClusteringLearning topic-name matrix
by PLSA and LDA(feature set)
Topic disambiguate with agglomerative clustering method
In similar topic:generate name-name matrix
People disambiguate with another
agglomerative clustering method
Levenshtein distance (defined as Le(x, y)) is used as the measurement and as a result the similarity between two names x and y
2008/03/21 Y.H Chang 17
Outline
Introductoin Related Work Method
Topic-based PLSA Topic-based LDA Clustering
Experiment Conclusion
2008/03/21 Y.H Chang 18
Experiment
Web Appearances of Person Names 12 person names => 187 different people including SRI employees and professors are submitted as queries to the Google se
arch engine, the first 100 pages are then retrieved for each query. Furthermore, to eliminate the bias towards longer documents, only the first 200 words are used in each example.
Author Appearances in Scientific Docs We obtained the 9 most ambiguous author names from the entire data set , each of
which has at least 20 name variations. In the worst case (C. Chen), 103 authors share the same name.
2008/03/21 Y.H Chang 19
Experiment
Evaluation : pair-level pairwise F1 score F1P and clusterlevel
pairwise F1 score F1C. F1P is defined as the pairwise precision pp and
pairwise recall pr Likewise, F1C is the harmonic mean of cluster p
recision cp and cluster recall cr
2008/03/21 Y.H Chang 20
author-topic relationships in the CiteSeer data set extracted by the topic-based PLSA model.
2008/03/21 Y.H Chang 21
Experiment
2008/03/21 Y.H Chang 22
Experiment
2008/03/21 Y.H Chang 23
Experiment
As a result, we empirically tested our models for the entire CiteSeer data set with more than 750,000 documents.
PLSA yields 418,500 unique authors in 2,570 minutes, while LDA finishes in 4,390 minutes with 418,775 authors.(1~3 days)
2008/03/21 Y.H Chang 24
Outline
Introductoin Related Work Method
Topic-based PLSA Topic-based LDA Clustering
Experiment Conclusion
2008/03/21 Y.H Chang 25
Conclusion We have proposed a novel framework for unsupervised name
disambiguation by leveraging graphical Bayesian models and a hierarchical clustering method.
Although our primary focus in this paper is on person name disambiguation, our general approach should be equally applicable to other entity disambiguation domains.
Potential applications include noun phrases disambiguation,e.g., “tiger” as an animal, “tiger” as a golf player, “tiger” the baseball team, “tiger” the operating system or “tiger” for the new Java version.