table recognition
Post on 22-Nov-2014
509 Views
Preview:
DESCRIPTION
TRANSCRIPT
The DIADEM Ontology
DIADEM 1.0
Yiyang Bao2, Xiaonan Guo2, Giorgio Orsi1,2, Christian Schallhart2, Cheng Wang2
1Institute for the Future of ComputingUniversity of Oxford
2Department of Computer Science
University of Oxford
The languages of the web
HTML objects provide the data model of a web-page.
CSS boxes and properties provide the layout.
Javascript provides web dynamics.
<html> <head> </head> <body> <title> </title> <div> … </div> </body></html>
ox:Property
xsd:string
ox:address
RealWorld
Web
this.value.toLowerCase();
… ?
RDF annotations provide the conceptualization of the domain.
Why ontology?
Ontologies provide a conceptualization of a domain of interest (Gruber ‘93)
ox:Property
xsd:string
ox:address
ox:minPrice
ox:partOf
ox:priceSegment But… we do not only want to model the application domain
We must model the domain of its web representations, i.e., its phenomenology.
In the end, it is also an ontology
Why ontology?
Can be used to complete an incomplete model.
Can be used to verify a model.
Must tolerate uncertainty and inconsistency.
A logical model for web extraction
Logical model for web entities
input and refinement forms.
result pages
page blocks (e.g., ads)
…
Phenomenological model
How logical entities are concretely represented
The building blocks
HTML entities
labels
fields (included links)
text-nodes and text attributes
<form> <label for="male">Male</label> <input type="radio" name="sex" id="male" /> <label for="female">Female</label> <input type="radio" name="sex" id="female" /></form>
<div> <span> Price: </span> <span> £ 250 </span></div>
Price: £ 250
Logical entities
constructs of our data model
Rules
describe the phenomenology
The form model
Goal: model web form phenomenology
The form model
Areas:
button
location
price
room
type
buy/rent
order-by
display
Root entity:
RealEstateForm
Properties:
partOf hierarchical structures
The form model: elements
price
type = {min, max}
purpose = {buy, rent}
currency
room
category = {bathroom, bedroom, …}
type = {min, max}
The form model: elements
display
per page
add-in-time
property type
button
submit
reset
map search
advance submit
link button
order-by
buy
rent
buy/rent
new/resale
SSTC
other
The form model: phenomenology
Based on linguistic annotations and (visual) heuristics.
buyElement(X,F) :- visibleField(X),hasAnnotationFeature(X,"majorType", "reform.label"),hasAnnotationFeature(X,"minorType", "buy"),not hasAnnotationFeature(X,"minorType", "rent"),not hasAnnotationFeature(X,"minorType", "includeSSTC"),group(Ns,_,_,F),#member(X,Ns).
radiusElement(X,F) :-visibleField(X),hasAnnotationFeature(X,"majorType","reform.label"),hasAnnotationFeature(X,"minorType","radius"),group(Ns,_,_,F),#member(X,Ns).
The form model: segments
A segment is:
o a single element
o a group of elements
o a group of segments
o a pair <segment, label>
Segments buttons geographic price Room property type buy/rent order-by display per page add in time new/resale SSTC
Form
real-estate
The result-page model
Goal: model result-pages phenomenology
The result-page model
Attributes and values
e.g., < price, £ 250,000 >
Record
groups of pairs < attribute, value >
Data area
groups of records
Mandatory attribute(s)
must be present in a record
sanity check purposes
A Conceptual Model for Data Extraction
Conceptual Modelling on the Web
Software modelling e.g., UML and stereotypes
Ad hoc languages e.g., WebML
Linking the domain ontology: OntoX
DIADEM Ontology: discussion
Expressive power
safe nr-datalog with stratified negation and aggregation
pros: easy to compute
cons: not robust to uncertainty and inconsistencies
Adaptability
result-page model is substantially domain independent
Form model is domain dependent (entity types)
• The number of entities is limited
Uncertainty, Vagueness and Inconsistencies
Origin
annotations are noisy
entity types are uncertain
Multiple models
probabilistic models
• Markov Logic Networks (Lukasiewicz and Simari)
• C-tables, Bayesian Networks (Olteanu)
ASP
• disjunctive models
• weak constraints
Uncertainty, Vagueness and Inconsistencies
Thank you!
top related