bootstrapping information extraction from semi-structured web pages andy carlson (machine learning...

Bootstrapping Information Extraction from Semi-Structured

Web PagesAndy Carlson (Machine Learning Department, Carnegie Mellon)

Charles Schafer (Google Pittsburgh)

ECML/PKDD 2008

Semi-Structured Web Pages: Vacation Rentals

Semi-Structured Web Pages: Nobel Prize Winners

3

Semi-Structured Web Pages: Museum Collections

4

Structured Data

Structured data enables better search interfaces

6

Supervised Information Extraction

Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.

Bootstrapping IE from Semi-Structured Web Pages

Assume that we have wrappers for a number of sites in a domain and thus many records from those sites.

Can we use what we’ve learned to automatically wrap a new site in the same domain?

9

From unlabeled pages to DOM trees

Unlabeled pages from new sitetexttexttext

<img>

<html>

<body>

<h1> <h4> <div>

text

<div>

DOM tree

texttexttext

<img>

<html>

<body>

<h1> <h4> <div>

text

<div>

DOM tree

10

From DOM trees to template tree

texttexttext

<img>

<html>

<body>

<h1> <h4> <div>

text

<div>

DOM tree

texttexttexttext

<html>

<body>

<h1> <h4> <div>

text

<div> <div>

DOM tree

texttexttext

<img>

text

<html>

<body>

<h1> <h4> <div>

text

<div> <div>

Template tree

Tree alignment

<div>

<div>

11

Supervised setting: Labels from user annotations

Learn labels from user

annotations

Generalized template

<img>

text

<html>

<body>

<h1> <h4>

text text

Generalized extraction template

<img>

<html>

<body>

<h1> <h4>

text text text

<div>

<div>

12

Bootstrapping setting: Labels from classifiers

Label data fields with classifiers

Generalized template

<img>

text

<html>

<body>

<h1> <h4>

text text

Generalized extraction template

<img>

<html>

<body>

<h1> <h4>

text text text

Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedrooms:

Boston

Las VegasNew YorkMiamiPalm SpringsNew York

Bedrooms:

2

35421

Framing the classification problem

13

Boston

Las VegasNew YorkMiamiPalm SpringsNew York

Canoe

GrillDVD PlayerHeated PoolDeckGas Grill

Boston

HoustonAtlantaTopekaPhiladelphiaNew Haven

Baltimore

San JoseTopekaSeattleLas VegasYorktown

Atlanta

Las VegasBillingsGreat FallsMissoulaBozeman

City

Other

Site A Site B Site C

Amenities:Amenities:Amenities:Amenities:Amenities:Amenities:

3

36445

1/1/09

6/9/087/13/087/20/089/13/085/15/08

Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedrooms:Bedroom:

1.5

2.532.53.52

Description:

Description:Description:Description:Description:Description:

717-0474835-7694845-0923934-9720663-1111646-0957

$78

$36$14$99$13$64

Training Sites

14

Comparing fields: Feature types

Content:Tokens

- Split on tokens because lots of data types have some vocabulary but order is not important.

Character 3-grams- Useful for matching “fulltime” and “full-time”

Token types (all digits, all caps, etc.)- Helpful for addresses, unique IDs, other fields with a mix of token types

Context:Precontext character 3-grams

- Sites vary their wordings, but often use variants of the same words

15

Naïve classification attempt

Logistic Regression:• Each data field from training sites is a

labeled instance for each schema column

• Use features we just described

Problems:• Tens of training instances

• Tens of thousands of features

• Serious overfitting

Coarser Features: Distributional similarity

Treat each field as a distribution of values

Compute distributional similarity for each feature type:

Smooth and normalize to Skew Similarity

16

Smarter classification attempt

Stacked Skews model:• Each field from each training site is a labeled instance

• Features are distributional similarity for each feature type

• Train linear regression model

• Inspired by database schema matching by [Madhavan et al. 2005]

Now:• Tens of training instances

• One feature per feature type – just a handful

• Appropriately sized learning problem

17

Related work

Unsupervised wrapper induction typically doesn’t label data fields

- e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005]

DeLa system of [Wang & Lochovsky, 2003]

- Heuristic rule-based mapping of fields to labels

- Requires explicit prompts of extracted fields

[Golgher et al, 2001]

- Finds exact matches of data values and looks for consistent context

18

Evaluation: Vacation rentals

19

Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address

20

Evaluation: Job listings

Schema: Title, Company, Location, Date Posted, Job Type, ID

21

Results

Accuracy by schema column

• Significantly outperforms logistic regression baseline.• With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain.

Thank You

Results by Schema Column

Results by Web Site

Feature Type Ablation Study Results

bootstrapping information extraction from semi-structured web pages andy carlson (machine learning...

Documents

training sites

structured datastructured

feature typescontent

unlabeled pages

data types

classifierslabel data

bootstrapping setting

classification problem