iterative set expansion of named entities using the web richard c. wang and william w. cohen...

21
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

Iterative Set Expansion of Named Entities using the Web

Richard C. Wang and William W. Cohen

Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA

Page 2: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

2 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Outline Introduction to Set Expansion

SE System – SEAL Current Issue with SEAL Proposed Solution

Iterative SEAL (iSEAL) Evaluation Setting Experimental Results Conclusion

Page 3: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

3 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Set Expansion (SE)

For example, Given a query (seeds):

{ survivor, amazing race }

The answer is: { american idol, big brother, etc. }

A well-known example of a SE system is Google Sets™ http://labs.google.com/sets

Page 4: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

4 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

SE System: SEAL (Wang & Cohen, ICDM 2007)

Features Independent of human/markup language

Support seeds in English, Chinese, Japanese, Korean, ... Accept documents in HTML, XML, SGML, TeX, WikiML, …

Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web

Based on two research contributions Automatically construct wrappers for extracting cand

idate items Rank candidates using random walk

Try it out for yourself at www.BooWa.com

Page 5: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

5 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

SEAL’s PipelineCanonNikonOlympus

PentaxSonyKodakMinoltaPanasonicCasioLeicaFujiSamsung…

Fetcher: Download web pages containing all seeds Extractor: Construct wrappers for extracting candidate items Ranker: Rank candidate items using Random Walk

Page 6: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

6 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

contain

extract

extract

contain

How to Build a Graph?

A graph consists of a fixed set of… Node Types: { document, wrapper, item } Labeled Directed Edges: { contain, extract }

Each edge asserts that a binary relation r holds Each edge has an inverse relation r-1 (graph is cyclic)

curryauto.com

Wrapper #3

Wrapper #2

Wrapper #1

Wrapper #4

“honda”26.1%

“acura”34.6%

“chevrolet”22.5%

“bmw”8.4%

“volvo”8.4%

northpointcars.com

Page 7: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

7 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Limitation of SEAL

Performance drops significantly when given more than 5 seeds The Fetcher downloads web pages that contain all seeds However, not many pages has more than 5 seeds

Preliminary Study on Seed Sizes

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

2 3 4 5 6# Seeds (Seed Size)

Me

an

Av

era

ge

Pre

cis

ion

RW

Preliminary Study on Seed Sizes

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

2 3 4 5 6# Seeds (Seed Size)

Me

an

Av

era

ge

Pre

cis

ion

RW

PRBS

WL

Evaluated using Mean Average Precision on 36 datasets

For each dataset, we randomly pick n seeds (and repeat 3 times)

Page 8: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

8 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Motivation

1. Can SEAL be made to handle many seeds?

2. Can SEAL bootstrap given only a few seeds?

3. How well does SEAL’s ranker perform?

Page 9: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

9 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Proposed Solution: Iterative SEAL iSEAL makes several calls to SEAL

In each call (iteration) Expands a few seeds Aggregates statistics

We evaluated iSEAL using…Two iterative processesTwo seeding strategiesFive ranking methods

Page 10: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

10 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Iterative Process & Seeding Strategy Iterative Processes

1. Supervised At every iteration, seeds are obtained from a reliable source

(e.g. human)

2. Bootstrapping At every iteration, seeds are selected from candidate items

(except the 1st iteration)

Seeding Strategies1. Fixed Seed Size

Uses 2 seeds at every iteration

2. Increasing Seed Size Starts with 2 seeds, then 3 seeds for next iteration, and fixe

d at 4 seeds afterwards

Preliminary Study on Seed Sizes

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

2 3 4 5 6# Seeds (Seed Size)

Me

an

Av

era

ge

Pre

cis

ion

RW

PRBS

WL

Page 11: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

11 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Ranking Methods1. Random Walk with Restart

H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006.

2. PageRank L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ra

nking: Bringing order to the web. 1998.

3. Bayesian Sets Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.

4. Wrapper Length Weights each item based on the length of common contextual string of th

at item and the seeds

5. Wrapper Frequency Weights each item based on the number of wrappers that extract the item

Page 12: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

12 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Evaluation Datasets

Page 13: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

13 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Evaluation Metric / Procedure

Evaluation metric: Mean Average Precision Contains recall and precision-oriented aspects Sensitive to the entire ranking

Evaluation procedure: For every combination of iterative process,

seeding strategy, and ranking methods1. Perform 10 iterative expansions for each of the 36 datasets

(and repeat 3 times)

2. At every iteration, compute and report MAP

Page 14: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

14 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Fixed Seed Size (Supervised)

Initial Seeds

Page 15: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

15 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Fixed Seed Size (Supervised)

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

1 2 3 4 5 6 7 8 9 10

# Iterations (Cumulative Expansions)

Me

an

Av

era

ge

Pre

cis

ion

RW

PR

BS

WL

WF

Page 16: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

16 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Fixed Seed Size (Bootstrap)

Initial Seeds

Page 17: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

17 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Fixed Seed Size (Bootstrap)

86%

87%

88%

89%

90%

91%

92%

1 2 3 4 5 6 7 8 9 10

# Iterations (Cumulative Expansions)

Me

an

Av

era

ge

Pre

cis

ion

RW

PR

BS

WL

WF

Page 18: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

18 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Increasing Seed Size (Bootstrap)

Initial Seeds

Used Seeds

Page 19: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

19 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Increasing Seed Size (Bootstrapping)

89%

90%

91%

92%

93%

94%

1 2 3 4 5 6 7 8 9 10

# Iterations (Cumulative Expansions)

Me

an

Av

era

ge

Pre

cis

ion

RW

PR

BS

WL

WF

Page 20: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

20 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

Conclusion

1. Can SEAL be made to handle many seeds? Yes, by Fixed Seed Size (Supervised).

2. Can SEAL bootstrap given only a few seeds? Yes, by Increasing Seed Size (Bootstrapping).

3. How well does SEAL’s ranker perform?1. In supervised, RW is comparable to the best (BS)

2. In bootstrapping, RW outperforms others Robust to noisy seeds

Page 21: Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University

21 / 21Language Technologies Institute, Carnegie Mellon University

Iterative Set Expansion of Named EntitiesRichard C. Wang

The End – Thank You!

Try out Boo!Wa! at www.BooWa.comA SEAL-based list extractor for many languagesSend any feedback to: [email protected]