finding authoritative people from the web

26
Finding Authoritative Finding Authoritative People People from the Web from the Web Masanori Harada, Shin-ya Sato, Kazuhiro Kazama {harada,sato,kazama}@ingrid.org NTT Network Innovation Labs.

Upload: clifford-bryant

Post on 30-Dec-2015

22 views

Category:

Documents


1 download

DESCRIPTION

Finding Authoritative People from the Web. Masanori Harada, Shin-ya Sato, Kazuhiro Kazama {harada,sato,kazama}@ingrid.org NTT Network Innovation Labs. Contents. Motivation Why study finding people? Examples Approach Extract personal names on the web - PowerPoint PPT Presentation

TRANSCRIPT

Finding Authoritative People Finding Authoritative People from the Webfrom the Web

Masanori Harada, Shin-ya Sato, Kazuhiro Kazama{harada,sato,kazama}@ingrid.org

NTT Network Innovation Labs.

2

ContentsContents• Motivation

– Why study finding people?

– Examples

• Approach– Extract personal names on the web

– Find relevant people using a search engine

• Results– Performance evaluation

• Summary and future plans

3

BackgroundBackgroundAs the web is connected to the real world, we can:

1. Find real-world things by searching the web.2. Understand the real world by investigating the

web (and vice versa).

the real world

the web

searching

... connections

4

ObjectiveObjective

Find authoritative people for all sorts of topics by extending a web search engine

• Why find people?– Once people have been found, many other things

(e.g. books) can be retrieved using digital libraries

• What is authoritative?– People mentioned in many web pages with regard to

a queried topic

5

ScreenshotScreenshot

Relevant personal names

Relationships

Relevant web pages

6

Example (1) subject to peopleExample (1) subject to people

• “digital libraries” (1007 pages)

• Possible application: book finder– Using library catalogs, it could suggest relevant

books written by these authoritative people

1. Shigeo Sugimoto Univ. Library Information Science2. Koichi Tabata Univ. Library Information Science3. Jun Adachi National Institute of Informatics4. Takeo Yamamoto National Institute of Informatics5. Hiroyuki Taya National Diet Library

7

Example (2) thing to peopleExample (2) thing to people

• “Spirited Away” (35,936 pages)

• Possible application: movie recommender– Using movie databases, it could suggest movies

which share key people for any queried topic

1. Hayao Miyazaki director2. Bunta Sugawara voice actor3. Mari Natsuki voice actress4. Yumi Kimura singer of theme song5. Joe Hisaishi composer

8

Example (3) person to peopleExample (3) person to people

• “Masanori Harada” (205 pages)

• Possible application: social networking– Unlike social networking services like orkut, there

is no need to enter relationships manually

1. Masanori Harada me2. Shin-ya Sato co-author of this paper3. Kazuhiro Kazama co-author of this paper4. Kent Tamura web search researcher5. Isao Asai web search researcher

9

NEXASNEXAS Named entity extraction and association search

– Associate an entity and a web page by extracting names identifying the entity.

– Find entities associated with top-ranked web pages retrieved for a query.

・・・

Less relevant

More relevant(authoritative)

More relevant(authoritative)

Less relevant

Irrelevant

web search

A

B

C

10

Extracting personal namesExtracting personal names• Web data

– 52 million Japanese web pages collected in July 2003

• Japanese personal name extractor– Extracts only full names

• Assumes a full name can identify a person– Big name dictionaries enables accurate extraction

• Precision: 93.5%, Recall 85.3%

11

Personal names on the webPersonal names on the web• Personal names appear frequently on the web

– 6.6M unique names extracted from 52M web pages– 1/4 of web pages contain full names– Celebrities appear >10 thousand times

• singers, actors, sports stars, novelists, politicians, etc.– Most names appear only a few times

• Name frequency indicates popularity– But number of pages is easily affected by automatically

generated texts and spams– Number of servers is less affected– Authoritative people are mentioned in many servers

12

Procedure for finding peopleProcedure for finding people1. Find web pages using a full-text search engine

2. List personal names extracted from top T relevant and authoritative web pages

3. Calculate relevance scores and output top k relevant people, who– Appear frequently on top-ranked web pages– Do not appear frequently on irrelevant web pages

13

Calculating relevance scoreCalculating relevance score Scoring functions: df, sf, dfidf, and sfidf

df = document frequency within top T pages

sf = server frequency within top T pages + 0.01 df

idf = log (N / fp )

N = number of pages in the collection

fp= document frequency

sf : Alleviates effects of generated texts and spams

idf : Weight for moderating generally famous person

14

Performance evaluationPerformance evaluation• Compared 4 scoring functions for varying T

• Precision: average of scores of top k people– Judged if a person was relevant (Score: 1) or not

(Score: 0) by searching for the personal name using Google

• 45 simple topics were used– 15 musical instruments players or not?– 15 sports players or not?– 15 information technologies experts or not?

15

Precision of scoring functionsPrecision of scoring functions

• sf is very effective

• idf is fairly effective

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

prec

ision

number of people evaluated

sfidf sf

dfidf df

Number of people evaluated

Pre

cisi

on

sfidfsfdfidfdf

16

Precision with varying Precision with varying TT• More web pages do not necessary yield

better results

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

prec

ision

number of people evaluated

T=100 T=200 T=500

T=1000 T=2000 T=5000

Number of people evaluated

Pre

cisi

on

sfidf

T =500T =1000T =200T =2000T =5000T =100

17

Future workFuture work• Apply for other languages, especially English

– Can we distinguish different “John Smith”?

• Find books, companies, shops, etc.– By extracting ISBNs, domain names, places, etc.

• Analyze co-occurrences as a social network

Online demo is available athttp://valhalla.ingrid.org:28080/throughout the conference

* Japanese fonts and Java2 plug-in are required

18

Precison vs. result sizePrecison vs. result size• Too-specific or too-general topics are difficult.

Number of pages retrieved for a topic

sfidfT=1000

Pre

cisi

on o

f to

p 10

peo

ple

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1000000

prec

ision

of t

op 1

0 pe

ople

number of web pages retrieved for a topicsports info. tech. music inst. “databases”

“compiler theory”

19

Popular Japanese namesPopular Japanese names• singers, actors, sports starts, novelists, politicians, etc.

name occupation no. pages no. serversAyumi Hamasaki singer 80,024 10,317

Hikaru Utada singer 64,900 9,620Hideki Kuriyama baseball 58,090 317

Osamu Tezuka comic 49,155 7,960Aya Matsuura singer 49,091 5,389

Maki Goto singer 48,432 4,600Aya Ueto singer 41,462 3,346

Mai Kuraki singer 38,314 5,035Hidetoshi Nakata soccer 38,037 3,945

Ryoko Hirosue actress 36,926 5,846

Table: Top 10 most frequent names

20

Related workRelated work• ReferralWeb [Kautz 1997]

– Finds experts around the user by searching the web– Tested only with computer science topics

• Web Question Answering [Kwok 2001][Brill 2001]– Retrieve one exact answer to a long, complete natural

language question• Our contributions

– Observed the distribution of personal names on the web– Extended a web search engine so that it accurately finds

relevant people for all sorts of queries.

21

Common failuresCommon failures• Too specific topics

• Too general topics

• Name extraction errors– Falsely extracted non-names– Missed (not extracted) names

• Historical/Fictional characters

• Celebrities– Popular names often appear without regard to a

specific topic

22

NumbersNumbers• Number of web pages 52,302,805• Number of web servers 664,139• Number of pages w/ names 13,922,012• Number of name occurrences 117,091,977• Number of unique names 6,161,805• Total size of web pages 450GB• Size of inverted index 113GB• Size of dictionaries

– Family names 21,141– Personal names 12,130– Full names 19,675

23

Topics for the experimentTopics for the experiment• Musical instruments

– violin, cello, trumpet, clarinet, harp, percussion, synthesizer, ocarina, accordion, contrabass, pipe organ, and marimba

• Sports– soccer, baseball, marathon, swimming, rugby, volleyball,

basketball, boxing, badminton, ice hockey, speed skating, fencing, lacrosse, pole vault, and discus throw

• Information technology terms– databases, java, information retrieval, XML, IPv6, speech

recognition, P2P, data mining, machine translation, complexity theory, web search engines, probabilistic reasoning, simulated annealing, compiler theory, and randomized algorithms

24

Name extraction methodName extraction method• Procedure

1. Remove HTML tags.

2. Using a morphological analyzer, split each sentence into morphemes and assign Part-Of-Speech tags.

3. Extract <family name><personal name> sequences.

• Performance improved by enriching dictionaries17k families + 12k personals + 2k popular full names

Precision 78.4%, Recall 75.0%

21k families + 40k personals + 19k popular full namesPrecision 93.5%, Recall 85.3%

25

The same-name problemThe same-name problem

• Not very serious when we query a subject– Different people having the same name rarely exist in a

specific area

• Japanese family/personal names are diverse– Still, not a few people share the same name

• Solutions (under consideration)– Classifies web pages by topic– Analyze a social network around the name (Different

people have different friends)

26

Popularity and frequencyPopularity and frequency• Personal name frequencies indicate web users'

interest in celebrities.– The number of pages is prone to be affected by

automatically generated pages and spams.

– The number of servers is better to find popular people.

top 10 by the number of pages80,024 Ayumi Hamasaki (singer)64,900 Hikaru Utada (singer)58,090 Hideki Kuriyama (baseball)49,155 Osamu Tezuka (comic artist)49,091 Aya Matsuura (singer)48,432 Maki Goto (singer)41,462 Aya Ueto (singer)38,314 Mai Kuraki (singer)38,037 Hidetoshi Nakata (soccer)36,926 Ryoko Hirosue (actress)

top 10 by the number of servers10,317 Ayumi Hamasaki (singer)

9,620 Hikaru Utada (singer)8,325 Ieyasu Tokugawa (shogun)7,960 Osamu Tezuka (comic artist)7,731 Hideyoshi Toyotomi (general)7,627 Nobunaga Oda (general)7,426 Ringo Shiina (singer)6,678 Soseki Natsume (novelist)6,561 Kenji Miyazawa (poet)6,287 Hayao Miyazaki (anime director)