advisor: hsin-hsi chen reporter: y.h chang 2008-03-21

Efficient Topic-based Unsupervised Name Disambiguation

Yang Song, Jian Huang, Isaac G. Councill,Jia Li, C. Lee Giles

JCDL2007

Advisor: Hsin-Hsi ChenReporter: Y.H Chang

2008-03-21

2008/03/21 Y.H Chang 2

Outline

Introductoin Related Work Method

Topic-based PLSA (Probabilistic Latent Semantic Analysis) Topic-based LDA (Latent Dirichlet Allocation) Clustering

Experiment Conclusion

2008/03/21 Y.H Chang 3

Introductoin

Name ambiguity Sharing same name, misspelling, name abbreviations

Searching Google for “Yang Song”: 1st page shows five different people’s home pages

In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents.

2008/03/21 Y.H Chang 4

Introductoin

MethodLearning topic-name matrix

by PLSA and LDA(feature set)

Topic disambiguate with agglomerative clustering method

In similar topic:generate name-name matrix

People disambiguate with another agglomerative clustering method

2008/03/21 Y.H Chang 5

Outline


Topic-based PLSA Topic-based LDA Clustering


2008/03/21 Y.H Chang 6

Related Work [19]G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguati

on. 2003 (transitivity problem) [9]H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations

using a k-way spectral clustering method. 2005 (complexity O(N2)) [12]J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for l

arge-scale databases. 2006 [2]I. Bhattacharya and L. Getoor. A latent dirichlet model forunsupervised ent

ity resolution. 2006 The aforementioned work mainly tackled the name disambiguati

on problem using the metadata records of the authors. This paper solves the name disambiguation problem in a novel way, by accounting for the topic distribution of the authors and adopting unsupervised methods.

2008/03/21 Y.H Chang 7

Outline




Method

Learning topic-name matrixby PLSA and LDA

(feature set)


… … 　　　　　　　　 … …

2008/03/21 Y.H Chang 8

PLSA

From a statistical point of view, (1999) Hofmann presented an alternative to LSA, or Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) , which discovers sets of latent variables.

The model is described as an aspect model, assuming the existence of hidden factors underlying the co-occurrences among two sets of objects.

2008/03/21 Y.H Chang 9

PLSA

The goal of model fitting for PLSA is to estimate the parameters P(z),P(a|z), P(z|d),P(w|z), given a set of observations (d, a,w). The standard way to estimate the probability values is the Expectation-Maximization (EM) algorithm

z: topic of document

Size=K

People’s name

Word

document

2008/03/21 Y.H Chang 10

PLSA

2008/03/21 Y.H Chang 11

PLSA-Predicting New Name Appearances

Additionally, there is no natural way to assign probability to new documents.

Therefore, to predict the topics of new documents (with potentially new names) after training, the estimated P(w|z) parameters are used to estimate P(a|z) for new names a in test document dnew through a “folding-in” process.

Specifically, the E-step is the same as equation (4); however, the M-step maintains the original P(w|z) and only updates P(a|z) as well as P(z|d).

2008/03/21 Y.H Chang 12

LDA

(2003) Blei et al. introduced a Bayesian hierarchical model, Latent Dirichlet Allocation (LDA) , in which each document has its own topic distribution, drawn from a conjugate Dirichlet prior that remains the same for all documents in a collection.

2008/03/21 Y.H Chang 13

LDA

In our model, names (authors) and words are not directly related, i.e., each topic can generate a set of names and a set of words simultaneously with different probabilities, allowing more freedom to the model in parameter estimation.

a multinomial distribution φz for each topic z

a multinomial Distribution θd

a topic zdi from the multinomial distribution θd a name adi from the

multinomial distribution λzdi

a word wdi from the multinomial distribution φzdi

2008/03/21 Y.H Chang 14

LDA

In the following section, we apply the Gibbs sampling framework to get around the intractability problem of parameter estimation.

2008/03/21 Y.H Chang 15

Gibbs sampling for the LDA model

Note that in our case, we do not estimate the parameters α, β and λ. For simplicity and performance, they are fixed at 50/K, 0.01 and 0.1 respectively.

2008/03/21 Y.H Chang 16

ClusteringLearning topic-name matrix

by PLSA and LDA(feature set)


In similar topic:generate name-name matrix

People disambiguate with another

agglomerative clustering method

Levenshtein distance (defined as Le(x, y)) is used as the measurement and as a result the similarity between two names x and y

2008/03/21 Y.H Chang 17

Outline




2008/03/21 Y.H Chang 18

Experiment

Web Appearances of Person Names 12 person names => 187 different people including SRI employees and professors are submitted as queries to the Google se

arch engine, the first 100 pages are then retrieved for each query. Furthermore, to eliminate the bias towards longer documents, only the first 200 words are used in each example.

Author Appearances in Scientific Docs We obtained the 9 most ambiguous author names from the entire data set , each of

which has at least 20 name variations. In the worst case (C. Chen), 103 authors share the same name.

2008/03/21 Y.H Chang 19

Experiment

Evaluation : pair-level pairwise F1 score F1P and clusterlevel

pairwise F1 score F1C. F1P is defined as the pairwise precision pp and

pairwise recall pr Likewise, F1C is the harmonic mean of cluster p

recision cp and cluster recall cr

2008/03/21 Y.H Chang 20

author-topic relationships in the CiteSeer data set extracted by the topic-based PLSA model.

2008/03/21 Y.H Chang 21

Experiment

2008/03/21 Y.H Chang 22

Experiment

2008/03/21 Y.H Chang 23

Experiment

As a result, we empirically tested our models for the entire CiteSeer data set with more than 750,000 documents.

PLSA yields 418,500 unique authors in 2,570 minutes, while LDA finishes in 4,390 minutes with 418,775 authors.(1~3 days)

2008/03/21 Y.H Chang 24

Outline




2008/03/21 Y.H Chang 25

Conclusion We have proposed a novel framework for unsupervised name

disambiguation by leveraging graphical Bayesian models and a hierarchical clustering method.

Although our primary focus in this paper is on person name disambiguation, our general approach should be equally applicable to other entity disambiguation domains.

Potential applications include noun phrases disambiguation,e.g., “tiger” as an animal, “tiger” as a golf player, “tiger” the baseball team, “tiger” the operating system or “tiger” for the new Java version.

advisor: hsin-hsi chen reporter: y.h chang 2008-03-21

Documents