automatically extracting ontologically specified data from html tables with unknown structure

34
ER 2002 YU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University Funded by NSF

Upload: arnon

Post on 19-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure. David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University. Funded by NSF. Leverage this …. … to do this. Information Exchange. Source. Target. Information Extraction. Schema - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Automatically Extracting Ontologically Specified Data

from HTML Tableswith Unknown Structure

David W. Embley, Cui Tao, Stephen W. Liddle

Brigham Young University

Funded by NSF

Page 2: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Information ExchangeSource Target

InformationExtraction

SchemaMatching

Leveragethis …

… to dothis

Page 3: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Information Extraction

Page 4: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Extracting Pertinent Information from Documents

Page 5: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

A Conceptual-Modeling SolutionYear Price

Make Mileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

Page 6: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Car-Ads OntologyCar [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]

constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … …End;

Page 7: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Recognition and Extraction

Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (336)835-85970002 1998 Elantra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081

Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stereo0002 a/c0003 Auto0003 jade green0003 gold

Page 8: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Schema Matching for HTML Tables with Unknown Structure

Page 9: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Table-Schema Matching(Basic Idea)

• Many Tables on the Web• Ontology-Based Extraction

– Works well for unstructured or semistructured data– What about structured data – tables?

• Method– Form attribute-value pairs– Do extraction– Infer mappings from extraction patterns

Page 10: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Different Schemas

Target Database Schema{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Different Source Table Schemas– {Run #, Yr, Make, Model, Tran, Color, Dr}– {Make, Model, Year, Colour, Price, Auto, Air Cond.,

AM/FM, CD}– {Vehicle, Distance, Price, Mileage}– {Year, Make, Model, Trim, Invoice/Retail, Engine,

Fuel Economy}

Page 11: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Attribute is Value

Page 12: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Attribute-Value is Value

? ?

Page 13: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Value is not Value

Page 14: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Implied Values

``````

Page 15: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Missing Attributes

Page 16: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Compound Attributes

Page 17: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Factored Values

Page 18: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Split Values

Page 19: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Merged Values

Page 20: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Values not of Interest

Page 21: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Problem: Information Behind Links

Single-ColumnTable (formattedas list)

Tableextendingover severalpages

Page 22: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution

• Form attribute-value pairs (adjust if necessary)

• Do extraction

• Infer mappings from extraction patterns

Page 23: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Remove Internal Factoring

Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*

Unnest: Îź(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Îź (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Legend

ACURA

ACURA

Page 24: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Replace Boolean Values

Legend

ACURA

ACURA

β CD Table

Yes,

CD

CD

Yes,Yes,βAutoβAir CondβAM/FMYes,

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Page 25: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Form Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >

Page 26: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Adjust Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

Page 27: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Page 28: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Infer Mappings

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Each row is a car. πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπYearTable

Note: Mappings produce sets for attributes. Joining to form recordsis trivial because we have OIDs for table rows (e.g. for each Car).

Page 29: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Page 30: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

πPriceTable

Page 31: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Yes,ρ Colour←Feature π ColourTable U ρ Auto←Feature π Auto β AutoTable U ρ Air Cond.←Feature π Air Cond.

β Air Cond.Table U ρ AM/FM←Feature π AM/FM β AM/FMTable U ρ CD←Featureπ CDβ CDTableYes, Yes, Yes,

Page 32: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Experiment

• Tables from 60 sites• 10 “training” tables• 50 test tables• 357 mappings (from all 60 sites)

– 172 direct mappings (same attribute and meaning)– 185 indirect mappings (29 attribute synonyms, 5 “Yes/No”

columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

Page 33: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Results• 10 “training” tables

– 100% of the 57 mappings (no false mappings)– 94.6% of the values in linked pages (5.4% false declarations)

• 50 test tables– 94.7% of the 300 mappings (no false mappings)– On the bases of sampling 3,000 values in linked pages, we obtained 97%

recall and 86% precision

• 16 missed mappings– 4 partial (not all unions included)– 6 non-U.S. car-ads (unrecognized makes and models)– 2 U.S. unrecognized makes and models– 3 prices (missing $ or found MSRP instead)– 1 mileage (mileages less than 1,000)

Page 34: Automatically Extracting  Ontologically Specified Data from HTML Tables with Unknown Structure

ER 2002BYU Data Extraction Group

Conclusions• Summary

– Transformed schema-matching problem to extraction– Inferred semantic mappings– Discovered source-to-target mapping rules

• Evidence of Success– Tables (mappings): 95% (Recall); 100% (Precision)– Linked Text (value extraction): ~97% (Recall); ~86% (Precision)

• Future Work– Discover and exploit structure in linked text– Broaden table understanding– Integrate with current extraction tools

www.deg.byu.edu