iterative set expansion of named entities using the web richard c. wang and william w. cohen...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Iterative Set Expansion of Named Entities using the Web
Richard C. Wang and William W. Cohen
Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA
2 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Outline Introduction to Set Expansion
SE System – SEAL Current Issue with SEAL Proposed Solution
Iterative SEAL (iSEAL) Evaluation Setting Experimental Results Conclusion
3 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Set Expansion (SE)
For example, Given a query (seeds):
{ survivor, amazing race }
The answer is: { american idol, big brother, etc. }
A well-known example of a SE system is Google Sets™ http://labs.google.com/sets
4 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
SE System: SEAL (Wang & Cohen, ICDM 2007)
Features Independent of human/markup language
Support seeds in English, Chinese, Japanese, Korean, ... Accept documents in HTML, XML, SGML, TeX, WikiML, …
Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web
Based on two research contributions Automatically construct wrappers for extracting cand
idate items Rank candidates using random walk
Try it out for yourself at www.BooWa.com
5 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
SEAL’s PipelineCanonNikonOlympus
PentaxSonyKodakMinoltaPanasonicCasioLeicaFujiSamsung…
Fetcher: Download web pages containing all seeds Extractor: Construct wrappers for extracting candidate items Ranker: Rank candidate items using Random Walk
6 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
contain
extract
extract
contain
How to Build a Graph?
A graph consists of a fixed set of… Node Types: { document, wrapper, item } Labeled Directed Edges: { contain, extract }
Each edge asserts that a binary relation r holds Each edge has an inverse relation r-1 (graph is cyclic)
curryauto.com
Wrapper #3
Wrapper #2
Wrapper #1
Wrapper #4
“honda”26.1%
“acura”34.6%
“chevrolet”22.5%
“bmw”8.4%
“volvo”8.4%
northpointcars.com
7 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Limitation of SEAL
Performance drops significantly when given more than 5 seeds The Fetcher downloads web pages that contain all seeds However, not many pages has more than 5 seeds
Preliminary Study on Seed Sizes
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
2 3 4 5 6# Seeds (Seed Size)
Me
an
Av
era
ge
Pre
cis
ion
RW
Preliminary Study on Seed Sizes
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
2 3 4 5 6# Seeds (Seed Size)
Me
an
Av
era
ge
Pre
cis
ion
RW
PRBS
WL
Evaluated using Mean Average Precision on 36 datasets
For each dataset, we randomly pick n seeds (and repeat 3 times)
8 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Motivation
1. Can SEAL be made to handle many seeds?
2. Can SEAL bootstrap given only a few seeds?
3. How well does SEAL’s ranker perform?
9 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Proposed Solution: Iterative SEAL iSEAL makes several calls to SEAL
In each call (iteration) Expands a few seeds Aggregates statistics
We evaluated iSEAL using…Two iterative processesTwo seeding strategiesFive ranking methods
10 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Iterative Process & Seeding Strategy Iterative Processes
1. Supervised At every iteration, seeds are obtained from a reliable source
(e.g. human)
2. Bootstrapping At every iteration, seeds are selected from candidate items
(except the 1st iteration)
Seeding Strategies1. Fixed Seed Size
Uses 2 seeds at every iteration
2. Increasing Seed Size Starts with 2 seeds, then 3 seeds for next iteration, and fixe
d at 4 seeds afterwards
Preliminary Study on Seed Sizes
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
2 3 4 5 6# Seeds (Seed Size)
Me
an
Av
era
ge
Pre
cis
ion
RW
PRBS
WL
11 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Ranking Methods1. Random Walk with Restart
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006.
2. PageRank L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ra
nking: Bringing order to the web. 1998.
3. Bayesian Sets Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.
4. Wrapper Length Weights each item based on the length of common contextual string of th
at item and the seeds
5. Wrapper Frequency Weights each item based on the number of wrappers that extract the item
12 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Evaluation Datasets
13 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Evaluation Metric / Procedure
Evaluation metric: Mean Average Precision Contains recall and precision-oriented aspects Sensitive to the entire ranking
Evaluation procedure: For every combination of iterative process,
seeding strategy, and ranking methods1. Perform 10 iterative expansions for each of the 36 datasets
(and repeat 3 times)
2. At every iteration, compute and report MAP
14 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Fixed Seed Size (Supervised)
Initial Seeds
15 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Fixed Seed Size (Supervised)
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Me
an
Av
era
ge
Pre
cis
ion
RW
PR
BS
WL
WF
16 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Fixed Seed Size (Bootstrap)
Initial Seeds
17 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Fixed Seed Size (Bootstrap)
86%
87%
88%
89%
90%
91%
92%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Me
an
Av
era
ge
Pre
cis
ion
RW
PR
BS
WL
WF
18 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Increasing Seed Size (Bootstrap)
Initial Seeds
Used Seeds
19 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Increasing Seed Size (Bootstrapping)
89%
90%
91%
92%
93%
94%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Me
an
Av
era
ge
Pre
cis
ion
RW
PR
BS
WL
WF
20 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
Conclusion
1. Can SEAL be made to handle many seeds? Yes, by Fixed Seed Size (Supervised).
2. Can SEAL bootstrap given only a few seeds? Yes, by Increasing Seed Size (Bootstrapping).
3. How well does SEAL’s ranker perform?1. In supervised, RW is comparable to the best (BS)
2. In bootstrapping, RW outperforms others Robust to noisy seeds
21 / 21Language Technologies Institute, Carnegie Mellon University
Iterative Set Expansion of Named EntitiesRichard C. Wang
The End – Thank You!
Try out Boo!Wa! at www.BooWa.comA SEAL-based list extractor for many languagesSend any feedback to: [email protected]