namesake

19
Presenter: Kuan-Hua Huo (霍冠樺) Date: 2010/04/13 Lili Jiang1, Jianyong Wang2, Ning An3, Shengyuan Wang2, Jian Zhan2, Lian Li1 1School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China 2Department of Computer Science and Technology, Tsinghua University, Beijing, China 3Oracle Corporation, Nashua, NH, USA [email protected] {jianyong,wwssyy,zhanjian}@tsinghua.edu.cn;[email protected];[email protected] Two Birds with One Stone: A Graph-based Framework for Disambiguating and Tagging People Names in Web Search WWW 2009 MADRID! Poster Sessions: Friday, April 24, 2009 1

Upload: avelin-huo

Post on 04-Jul-2015

821 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Namesake

Presenter: Kuan-Hua Huo (霍冠樺)

Date: 2010/04/13

Lili Jiang1, Jianyong Wang2, Ning An3, Shengyuan Wang2, Jian Zhan2, Lian Li1

1School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China

2Department of Computer Science and Technology, Tsinghua University, Beijing, China

3Oracle Corporation, Nashua, NH, USA

[email protected]

{jianyong,wwssyy,zhanjian}@tsinghua.edu.cn;[email protected];[email protected]

Two Birds with One Stone: A Graph-based Framework

for Disambiguating and Tagging People Names in

Web Search

WWW 2009 MADRID! Poster Sessions: Friday, April 24, 2009 1

Page 2: Namesake

22

Outline

• INTRODUCTION

• THE FRAMEWORK FOR WEB PEOPLE NAME DISAMBIGUATION AND TAGGING

– Graph Modeling

– Clustering Algorithm

– Tagging a Namesake

• EXPERIMENTAL EVALUATION

• CONCLUSION

Page 3: Namesake

INTRODUCTION (1/2)

• We have observed that people tag information, such as

– locations and

– e-mail addresses,

– appears to be informative,

– and a combination of such tags can almost identify a unique target people.

3

Page 4: Namesake

INTRODUCTION (2/2)

• The contributions of this work include:

– a) A novel weighted graph representation was proposed to model the relationships between tags;

– b) An effective graph clustering algorithm was devised to disambiguate and tag people names;

– c) An extensive performance study was conducted, which shows that the proposed framework achieves much better performance than previous approaches.

4

Page 5: Namesake

THE FRAMEWORK FOR WEB PEOPLE NAME DISAMBIGUATION AND TAGGING

• Graph Modeling

• Clustering Algorithm

• Tagging a Namesake

5

Page 6: Namesake

Graph Modeling (1/6)

• Document corpus D = {d1, d2, . . . , dk} be the top kreturned results from a search engine.

• We extract seven types of tags T = {people name, organization, location, email address, phone number, date, occupation} from chunk windows around the query name in each cleaned document

• A = S7 i=1, is a union of seven tag sets where Ti contains the non-repetitive tags with the type

6

Page 7: Namesake

Graph Modeling (2/6)

• Regarding a people name, the proposed framework views the tag corpus A as an undirected labeled graph G=(V,E),

– The node set V = {v1, v2, . . .} is a set of unique tags.

– Each edge (vi, vj) 2 E represents that the tags cor-responding to viand vj co-occur in a same document in D

7

Page 8: Namesake

Graph Modeling (3/6)

• Bridge-tags: the tags occurring in multiple documents.

• The maximal clique subgraph containing all tags in a single document as a micro-cluster.

8

• Figure 1 shows an example of tag graph.

– Nodes 3, 4 and 5: bridge-tag.

– There are in total four micro-clusters.

Page 9: Namesake

Graph Modeling (4/6)

• Assign different type weights to nodes with different tag types

– Node Weighting

– Edge Weighting

• Intuitively, an ideal tag type should have the following property that each of its tags is uniquely associated with one namesake, while the more the number of distinct tags of one type for each namesake w.r.t. a query name, the less this tag type contributes in name disambiguation. 9

Page 10: Namesake

Graph Modeling (5/6)

• Node Weighting

– The higher the number of unique tags of a certain type, the smaller the weight of each node with this tag type.

– v denotes any tag with the type ti2 T.

– |A| is the number of non-repetitive tags in D

– |Ti| is the number of unique tags of the type “ti”.

10

Page 11: Namesake

Graph Modeling (6/6)

• Edge Weighting

– Assuming the conditional independence between any two unique tags, we have the joint probability for two connected nodes v and w, p(vw) = p(v)p(w).

– Then the edge between v and w, is weighted as

– where n is the number of documents in which tags v and w co-occur, pi(v) is the probability distribution obtained from the frequency of tag v divided by the total number of all tag occurrences in the ith document di. 11

Page 12: Namesake

Clustering Algorithm (1/2)

• We develop a two-step clustering approach for the problem of name disambiguation.

– The first step: To identify the complete set of micro-clusters (i.e., the maximal clique subgraphs).

– The second step: To use a single link clustering method to further cluster the set of micro-clusters generated from the previous step into a final set of macro-clusters.

12

Page 13: Namesake

Clustering Algorithm (2/2)

• Each micro-cluster M is stored as a vector of tags

• The similarity between any two micro-clusters, ConnectivityStrength, is estimated using cosine distance.

• The weight of any tag v in is defined by the following Cohesion formula.

13

Page 14: Namesake

Tagging a Namesake

• Tags with higher Cohesion weight in a cluster are selected to describe the corresponding namesake effectively.

14

Page 15: Namesake

EXPERIMENTAL EVALUATION (1/2)

• The measures used in the experiments are FEB (the Extended B-cubed measure based on Precision and Recall) and FPI (the harmonic mean of Purity and Inverse Purity) [1].

• The threshold for ConnectivityStrength used in clustering was empirically estimated based on the training data.

15

Page 16: Namesake

EXPERIMENTAL EVALUATION (2/2)

• The experimental results show that the approach achieves much better overall performance (measured by FEB and FPI ) in name disambiguation than the state-of-the-art systems.

• In addition, it can output a set of selected tags to describe each cluster (corresponding to a namesake).

16

Page 17: Namesake

CONCLUSION (1/2)

• A novel weighted-graph based framework

• Type weights for tags and a new way (i.e., Cohesion) to measure the importance of a tag in an unsupervised clustering algorithm

17

Page 18: Namesake

CONCLUSION (2/2)

• Experimental results show that the proposed ap-proach outperforms all the solutions in a recent Web people search task.

• In the future, we plan to apply the proposed framework in a real Web people search system.

18

Page 19: Namesake

19