table recognition

Post on 22-Nov-2014

509 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

The DIADEM Ontology

DIADEM 1.0

Yiyang Bao2, Xiaonan Guo2, Giorgio Orsi1,2, Christian Schallhart2, Cheng Wang2

1Institute for the Future of ComputingUniversity of Oxford

2Department of Computer Science

University of Oxford

The languages of the web

HTML objects provide the data model of a web-page.

CSS boxes and properties provide the layout.

Javascript provides web dynamics.

<html> <head> </head> <body> <title> </title> <div> … </div> </body></html>

ox:Property

xsd:string

ox:address

RealWorld

Web

this.value.toLowerCase();

… ?

RDF annotations provide the conceptualization of the domain.

Why ontology?

Ontologies provide a conceptualization of a domain of interest (Gruber ‘93)

ox:Property

xsd:string

ox:address

ox:minPrice

ox:partOf

ox:priceSegment But… we do not only want to model the application domain

We must model the domain of its web representations, i.e., its phenomenology.

In the end, it is also an ontology

Why ontology?

Can be used to complete an incomplete model.

Can be used to verify a model.

Must tolerate uncertainty and inconsistency.

A logical model for web extraction

Logical model for web entities

input and refinement forms.

result pages

page blocks (e.g., ads)

Phenomenological model

How logical entities are concretely represented

The building blocks

HTML entities

labels

fields (included links)

text-nodes and text attributes

<form> <label for="male">Male</label> <input type="radio" name="sex" id="male" /> <label for="female">Female</label> <input type="radio" name="sex" id="female" /></form>

<div> <span> Price: </span> <span> £ 250 </span></div>

Price: £ 250

Logical entities

constructs of our data model

Rules

describe the phenomenology

The form model

Goal: model web form phenomenology

The form model

Areas:

button

location

price

room

type

buy/rent

order-by

display

Root entity:

RealEstateForm

Properties:

partOf hierarchical structures

The form model: elements

price

type = {min, max}

purpose = {buy, rent}

currency

room

category = {bathroom, bedroom, …}

type = {min, max}

The form model: elements

display

per page

add-in-time

property type

button

submit

reset

map search

advance submit

link button

order-by

buy

rent

buy/rent

new/resale

SSTC

other

The form model: phenomenology

Based on linguistic annotations and (visual) heuristics.

buyElement(X,F) :- visibleField(X),hasAnnotationFeature(X,"majorType", "reform.label"),hasAnnotationFeature(X,"minorType", "buy"),not hasAnnotationFeature(X,"minorType", "rent"),not hasAnnotationFeature(X,"minorType", "includeSSTC"),group(Ns,_,_,F),#member(X,Ns).

radiusElement(X,F) :-visibleField(X),hasAnnotationFeature(X,"majorType","reform.label"),hasAnnotationFeature(X,"minorType","radius"),group(Ns,_,_,F),#member(X,Ns).

The form model: segments

A segment is:

o a single element

o a group of elements

o a group of segments

o a pair <segment, label>

Segments buttons geographic price Room property type buy/rent order-by display per page add in time new/resale SSTC

Form

real-estate

The result-page model

Goal: model result-pages phenomenology

The result-page model

Attributes and values

e.g., < price, £ 250,000 >

Record

groups of pairs < attribute, value >

Data area

groups of records

Mandatory attribute(s)

must be present in a record

sanity check purposes

A Conceptual Model for Data Extraction

Conceptual Modelling on the Web

Software modelling e.g., UML and stereotypes

Ad hoc languages e.g., WebML

Linking the domain ontology: OntoX

DIADEM Ontology: discussion

Expressive power

safe nr-datalog with stratified negation and aggregation

pros: easy to compute

cons: not robust to uncertainty and inconsistencies

Adaptability

result-page model is substantially domain independent

Form model is domain dependent (entity types)

• The number of entities is limited

Uncertainty, Vagueness and Inconsistencies

Origin

annotations are noisy

entity types are uncertain

Multiple models

probabilistic models

• Markov Logic Networks (Lukasiewicz and Simari)

• C-tables, Bayesian Networks (Olteanu)

ASP

• disjunctive models

• weak constraints

Uncertainty, Vagueness and Inconsistencies

Thank you!

top related