scheme matching and data extraction over html tables

23
Scheme Matching and Data Extraction over HTML Tables Cui Tao June, 2002 supported by NSF

Upload: tashya-stokes

Post on 02-Jan-2016

28 views

Category:

Documents


3 download

DESCRIPTION

Scheme Matching and Data Extraction over HTML Tables. Cui Tao June, 2002. supported by NSF. Introduction. Many tables on the Web Ontology-based extraction: Works for unstructured or semi-structured data Does not work well for structured data -- tables - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scheme Matching and Data Extraction over HTML Tables

Scheme Matching and Data Extraction over HTML Tables

Cui TaoJune, 2002

supported by NSF

Page 2: Scheme Matching and Data Extraction over HTML Tables

Introduction

Many tables on the Web Ontology-based extraction:

Works for unstructured or semi-structured data

Does not work well for structured data -- tables

Only tables for information, not for layout

Page 3: Scheme Matching and Data Extraction over HTML Tables

Problems

Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air

Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,

Engine, Fuel Economy} Target database schema

{Car, Year, Make, Model, Mileage, Price, PhoneNr},

{Car, Feature}

Different schemas

Page 4: Scheme Matching and Data Extraction over HTML Tables

ProblemsAttribute value pairs

?

Page 5: Scheme Matching and Data Extraction over HTML Tables

ProblemsAttribute value switch

Page 6: Scheme Matching and Data Extraction over HTML Tables

ProblemsAttribute/value combinations

Year/sty Cyl. # Dr Tran Color

Page 7: Scheme Matching and Data Extraction over HTML Tables

ProblemsAttribute/value split

Model

Page 8: Scheme Matching and Data Extraction over HTML Tables

Problems Information in linked pages

Tables Lists Unstructured data …

Header information

Page 9: Scheme Matching and Data Extraction over HTML Tables

Thesis Statement

Extraction Ontology

HTML table withUnknown-structure

MappingRules

ExtractedData

Page 10: Scheme Matching and Data Extraction over HTML Tables

Methods

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

• Understand Table.• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data.

Understand table Recognize table and its element

<TABLE>, </TABLE> <TR>: Row; <TD>: Data Entry; <TH>: Header.

Page 11: Scheme Matching and Data Extraction over HTML Tables

Methods Form attribute-value

pairs Regular table

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Nrcom =

Most common number of columns in the table

Table with factors

Page 12: Scheme Matching and Data Extraction over HTML Tables

Table has Boolean values

Methods

Form Attribute-Value Pairs Regular Table Table with factors

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Replace Boolean Values:

Page 13: Scheme Matching and Data Extraction over HTML Tables

Form Attribute-Value pairs

Methods

Form Attribute-Value Pairs Regular Table Table with factors Table has Boolean values

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Page 14: Scheme Matching and Data Extraction over HTML Tables

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Adjust attribute-value Pairs

Page 15: Scheme Matching and Data Extraction over HTML Tables

Table: attribute-value pairs

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data Add Information Hidden

Behind Links Unstructured and semi structured: concatenate

<Manufacturer, Honda>, <Model, Civic EX>, <Door, 4>, <Year, 1995>, <Color, White>, <Engine, 2.0L 4 Cylinders> <Transmission, Auto>, <Mileage, 82,628> <Price, $6300>

Page 16: Scheme Matching and Data Extraction over HTML Tables

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data Add Information Hidden

Behind Links Unstructured and semi- structured: concatenate Table: attribute-value pairs

Page 17: Scheme Matching and Data Extraction over HTML Tables

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data Add Information Hidden

Behind Links Unstructured and semi- structured: concatenate Table:attribute value pairs List:

<Features, AIR CONDITIONING, CD, AM/FM, CLOTH UPHOLSTERY, CONSOLE, CRUISE CONTROL, DUAL AIR BAGS, INSIDE HOOD RELEASE, POWER DOOR LOCKS, POWER STEERING, POWER SUNROOF, POWER WINDOWS, RADIAL TIRES, REAR DEFROSTER, REAR SPOILER, RECLINING SEATS>

Page 18: Scheme Matching and Data Extraction over HTML Tables

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Inferred Mapping Creation:

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Each row is a car.

Page 19: Scheme Matching and Data Extraction over HTML Tables

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 20: Scheme Matching and Data Extraction over HTML Tables

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 21: Scheme Matching and Data Extraction over HTML Tables

• Table Understanding.• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Inferred Mapping Creation• Data Extraction.

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Page 22: Scheme Matching and Data Extraction over HTML Tables

Evaluation Measure percentage of correct

mappings: Correct mapping Partially correct mapping Incorrect mapping

Measure precision and recall: Data in the table Data in linked pages

Compare the results for extracted data before mapping and after mapping

Page 23: Scheme Matching and Data Extraction over HTML Tables

Contribution Provides an approach to extract

information automatically from HTML tables

Suggests a different way to solve the problem of schema matching