bootstrapping information extraction from semi-structured web pages andy carlson (machine learning...

25
Bootstrapping Information Extraction from Semi- Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google Pittsburgh) ECML/PKDD 2008

Upload: meghan-coleen-richardson

Post on 28-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Bootstrapping Information Extraction from Semi-Structured

Web PagesAndy Carlson (Machine Learning Department, Carnegie Mellon)

Charles Schafer (Google Pittsburgh)

ECML/PKDD 2008

Page 2: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Semi-Structured Web Pages: Vacation Rentals

Page 3: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Semi-Structured Web Pages: Nobel Prize Winners

3

Page 4: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Semi-Structured Web Pages: Museum Collections

4

Page 5: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Structured Data

Page 6: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Structured data enables better search interfaces

6

Page 7: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Supervised Information Extraction

Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.

Page 8: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Bootstrapping IE from Semi-Structured Web Pages

Assume that we have wrappers for a number of sites in a domain and thus many records from those sites.

Can we use what we’ve learned to automatically wrap a new site in the same domain?

Page 9: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

9

From unlabeled pages to DOM trees

Unlabeled pages from new sitetexttexttext

<img>

<html>

<body>

<h1> <h4> <div>

text

<div>

DOM tree

texttexttext

<img>

<html>

<body>

<h1> <h4> <div>

text

<div>

DOM tree

Page 10: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

10

From DOM trees to template tree

texttexttext

<img>

<html>

<body>

<h1> <h4> <div>

text

<div>

DOM tree

texttexttexttext

<html>

<body>

<h1> <h4> <div>

text

<div> <div>

DOM tree

texttexttext

<img>

text

<html>

<body>

<h1> <h4> <div>

text

<div> <div>

Template tree

Tree alignment

Page 11: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

<div>

<div>

11

Supervised setting: Labels from user annotations

Learn labels from user

annotations

Generalized template

<img>

text

<html>

<body>

<h1> <h4>

text text

Generalized extraction template

<img>

<html>

<body>

<h1> <h4>

text text text

Page 12: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

<div>

<div>

12

Bootstrapping setting: Labels from classifiers

Label data fields with classifiers

Generalized template

<img>

text

<html>

<body>

<h1> <h4>

text text

Generalized extraction template

<img>

<html>

<body>

<h1> <h4>

text text text

Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedrooms:

Boston

Las VegasNew YorkMiamiPalm SpringsNew York

Bedrooms:

2

35421

Page 13: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Framing the classification problem

13

Boston

Las VegasNew YorkMiamiPalm SpringsNew York

Canoe

GrillDVD PlayerHeated PoolDeckGas Grill

Boston

HoustonAtlantaTopekaPhiladelphiaNew Haven

Baltimore

San JoseTopekaSeattleLas VegasYorktown

Atlanta

Las VegasBillingsGreat FallsMissoulaBozeman

City

Other

Site A Site B Site C

Amenities:Amenities:Amenities:Amenities:Amenities:Amenities:

3

36445

1/1/09

6/9/087/13/087/20/089/13/085/15/08

Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedroom:

1.5

2.532.53.52

Description:

Description:Description:Description:Description:Description:

717-0474835-7694845-0923934-9720663-1111646-0957

$78

$36$14$99$13$64

Training Sites

Page 14: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

14

Comparing fields: Feature types

Content:Tokens

- Split on tokens because lots of data types have some vocabulary but order is not important.

Character 3-grams- Useful for matching “fulltime” and “full-time”

Token types (all digits, all caps, etc.)- Helpful for addresses, unique IDs, other fields with a mix of token types

Context:Precontext character 3-grams

- Sites vary their wordings, but often use variants of the same words

Page 15: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

15

Naïve classification attempt

Logistic Regression:• Each data field from training sites is a

labeled instance for each schema column

• Use features we just described

Problems:• Tens of training instances

• Tens of thousands of features

• Serious overfitting

Page 16: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Coarser Features: Distributional similarity

Treat each field as a distribution of values

Compute distributional similarity for each feature type:

Smooth and normalize to Skew Similarity

16

Page 17: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Smarter classification attempt

Stacked Skews model:• Each field from each training site is a labeled instance

• Features are distributional similarity for each feature type

• Train linear regression model

• Inspired by database schema matching by [Madhavan et al. 2005]

Now:• Tens of training instances

• One feature per feature type – just a handful

• Appropriately sized learning problem

17

Page 18: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Related work

Unsupervised wrapper induction typically doesn’t label data fields

- e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005]

DeLa system of [Wang & Lochovsky, 2003]

- Heuristic rule-based mapping of fields to labels

- Requires explicit prompts of extracted fields

[Golgher et al, 2001]

- Finds exact matches of data values and looks for consistent context

18

Page 19: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Evaluation: Vacation rentals

19

Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address

Page 20: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

20

Evaluation: Job listings

Schema: Title, Company, Location, Date Posted, Job Type, ID

Page 21: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

21

Results

Accuracy by schema column

• Significantly outperforms logistic regression baseline.• With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain.

Page 22: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Thank You

Page 23: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Results by Schema Column

Page 24: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Results by Web Site

Page 25: Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google

Feature Type Ablation Study Results