bootstrapping information extraction from semi-structured web pages andy carlson (machine learning...
TRANSCRIPT
Bootstrapping Information Extraction from Semi-Structured
Web PagesAndy Carlson (Machine Learning Department, Carnegie Mellon)
Charles Schafer (Google Pittsburgh)
ECML/PKDD 2008
Semi-Structured Web Pages: Vacation Rentals
Semi-Structured Web Pages: Nobel Prize Winners
3
Semi-Structured Web Pages: Museum Collections
4
Structured Data
Structured data enables better search interfaces
6
Supervised Information Extraction
Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.
Bootstrapping IE from Semi-Structured Web Pages
Assume that we have wrappers for a number of sites in a domain and thus many records from those sites.
Can we use what we’ve learned to automatically wrap a new site in the same domain?
9
From unlabeled pages to DOM trees
Unlabeled pages from new sitetexttexttext
<img>
<html>
<body>
<h1> <h4> <div>
text
<div>
DOM tree
texttexttext
<img>
<html>
<body>
<h1> <h4> <div>
text
<div>
DOM tree
10
From DOM trees to template tree
texttexttext
<img>
<html>
<body>
<h1> <h4> <div>
text
<div>
DOM tree
texttexttexttext
<html>
<body>
<h1> <h4> <div>
text
<div> <div>
DOM tree
texttexttext
<img>
text
<html>
<body>
<h1> <h4> <div>
text
<div> <div>
Template tree
Tree alignment
<div>
<div>
11
Supervised setting: Labels from user annotations
Learn labels from user
annotations
Generalized template
<img>
text
<html>
<body>
<h1> <h4>
text text
Generalized extraction template
<img>
<html>
<body>
<h1> <h4>
text text text
<div>
<div>
12
Bootstrapping setting: Labels from classifiers
Label data fields with classifiers
Generalized template
<img>
text
<html>
<body>
<h1> <h4>
text text
Generalized extraction template
<img>
<html>
<body>
<h1> <h4>
text text text
Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedrooms:
Boston
Las VegasNew YorkMiamiPalm SpringsNew York
Bedrooms:
2
35421
Framing the classification problem
13
Boston
Las VegasNew YorkMiamiPalm SpringsNew York
Canoe
GrillDVD PlayerHeated PoolDeckGas Grill
Boston
HoustonAtlantaTopekaPhiladelphiaNew Haven
Baltimore
San JoseTopekaSeattleLas VegasYorktown
Atlanta
Las VegasBillingsGreat FallsMissoulaBozeman
City
Other
Site A Site B Site C
Amenities:Amenities:Amenities:Amenities:Amenities:Amenities:
3
36445
1/1/09
6/9/087/13/087/20/089/13/085/15/08
Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedroom:
1.5
2.532.53.52
Description:
Description:Description:Description:Description:Description:
717-0474835-7694845-0923934-9720663-1111646-0957
$78
$36$14$99$13$64
Training Sites
14
Comparing fields: Feature types
Content:Tokens
- Split on tokens because lots of data types have some vocabulary but order is not important.
Character 3-grams- Useful for matching “fulltime” and “full-time”
Token types (all digits, all caps, etc.)- Helpful for addresses, unique IDs, other fields with a mix of token types
Context:Precontext character 3-grams
- Sites vary their wordings, but often use variants of the same words
15
Naïve classification attempt
Logistic Regression:• Each data field from training sites is a
labeled instance for each schema column
• Use features we just described
Problems:• Tens of training instances
• Tens of thousands of features
• Serious overfitting
Coarser Features: Distributional similarity
Treat each field as a distribution of values
Compute distributional similarity for each feature type:
Smooth and normalize to Skew Similarity
16
Smarter classification attempt
Stacked Skews model:• Each field from each training site is a labeled instance
• Features are distributional similarity for each feature type
• Train linear regression model
• Inspired by database schema matching by [Madhavan et al. 2005]
Now:• Tens of training instances
• One feature per feature type – just a handful
• Appropriately sized learning problem
17
Related work
Unsupervised wrapper induction typically doesn’t label data fields
- e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005]
DeLa system of [Wang & Lochovsky, 2003]
- Heuristic rule-based mapping of fields to labels
- Requires explicit prompts of extracted fields
[Golgher et al, 2001]
- Finds exact matches of data values and looks for consistent context
18
Evaluation: Vacation rentals
19
Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address
20
Evaluation: Job listings
Schema: Title, Company, Location, Date Posted, Job Type, ID
21
Results
Accuracy by schema column
• Significantly outperforms logistic regression baseline.• With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain.
Thank You
Results by Schema Column
Results by Web Site
Feature Type Ablation Study Results