annotating documents for the semantic web using data-extraction ontologies dissertation proposal...
Post on 21-Dec-2015
217 views
TRANSCRIPT
![Page 1: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/1.jpg)
Annotating Documents for Annotating Documents for the Semantic Web the Semantic Web
Using Data-Extraction Using Data-Extraction OntologiesOntologies
Annotating Documents for Annotating Documents for the Semantic Web the Semantic Web
Using Data-Extraction Using Data-Extraction OntologiesOntologies Dissertation ProposalDissertation Proposal
Yihong DingYihong Ding
![Page 2: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/2.jpg)
2
Motivation• The representation of web content
limits its usability
• A machine understandable web– Shared, explicit, formal
conceptualizations (ontologies)– The semantic web
![Page 3: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/3.jpg)
3
A Problem
• How to transform current web to be the semantic web?
![Page 4: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/4.jpg)
4
A Solution: Semantic Annotation
• Add explicit, formal, and unambiguous metadata to web documents
• Explicit: publicly accessible• Formal: publicly agreeable• Unambiguous: publicly identifiable
![Page 5: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/5.jpg)
5
Annotation Representation
Explicit Annotation
Implicit Annotation
![Page 6: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/6.jpg)
6
Semantic Annotation Current Research Status
• Manual annotation through friendly interfaces [Annotea, etc.]
• Automatic annotation with ontology generation [SCORE]
• Automatic annotation using automated IE tool based on pre-defined ontologies [SemTag, MnM, etc.]
![Page 7: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/7.jpg)
7
Current Automatic Annotator
a typical paradigm
Domain OntologyNon-ontology-based IE
Wrapper
Rules and extracting categories
Document
(1) Extraction
(2) Alignment
(3) Annotation
![Page 8: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/8.jpg)
8
Current Automatic Annotator
Problems
Domain Ontology
Document
(1) Problem of data recognition
(2) Problem ofconcept disambiguation
(3) Problem of Annotation formatting,storing, indexing, sharing
(4) Problem of Assembling ontologies
Non-ontology-based IE Wrapper
Rules and extracting categories
![Page 9: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/9.jpg)
9
“Main Drawback of Using Automated IE”
[Kiryakov04]
• “none of these approaches expects an input or produces output with respect to ontologies”
• “a set of heuristics for post-processing and mapping of the IE results to an ontology … not sufficient for large-scale, domain-independent semantic annotation.”
• “IE and wrapper induction techniques need to use the ontology more directly during the process of extraction.”
![Page 10: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/10.jpg)
10
Ontology-driven Paradigm
(Data-Extraction Ontology)
for Semantic Annotation
Document
Non-ontology-based
IE Wrapper
Ontology-basedIE Wrapper
Document
![Page 11: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/11.jpg)
11
Ontology-driven Paradigmfor Semantic Annotation
Some Arguments
• Resiliency w.r.t. web page layouts (helps scale to large set of web pages)
• Adpativeness w.r.t. domain specifications (helps scale to large size domains)
• Creation of ontologies: still a problem but no longer a drawback
• Speed of execution: still a drawback (but we are going to propose a solution next)
![Page 12: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/12.jpg)
12
Two-Layer Annotation Model
Conceptual Annotator using an
ontology-based IE tool
DocumentStructuralAnnotator
SampleAnnotationProcess
SimilarDocumentsMassive
AnnotationProcess
![Page 13: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/13.jpg)
13
Structural Annotator• Major components
– HTML hierarchical path that leads to concept locations
– Local context around locations– Dependencies among multiple semantic
categories
• Significance– Identify both categories and their semantic
meanings
![Page 14: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/14.jpg)
14
Ontology Factors in Semantic Annotation
Tasks• Knowledge specification
– Semantic web community– Web Ontology Language (OWL)
• Knowledge instantiation– IE and database community– Object-oriented System Model in XML
(OSMX)
![Page 15: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/15.jpg)
15
Ontology Conversion• Similarities (OWL vs. OSMX)
– Class vs. object set– ObjectProperty vs. relationship set– Cardinality restriction vs. participation constraint– subclassOf vs. is-a relationship
• Unique features– OWL
• subpropertyOf• symmetric and transitive property• namespace declaration• ontology importing
– OSMX• arbitrary n-ary relationship sets• data frames• general constraints
![Page 16: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/16.jpg)
16
Ontology Construction An Unavoidable Problem
• Semantic annotation tasks require ontologies.
• The ontology for a specific semantic annotation task is not promised to be available all the time.
![Page 17: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/17.jpg)
17
Ontology Construction General and Special
• Generally speaking– Until now, main stream, manual construction – Automatic and semi-automatic ontology generation,
many research papers, few or none practical, a very hard problem
• Special to semantic annotation purpose– Very dynamic and variant domains– Much overlapped information– Limited size of scope for one web page– Flat structure
![Page 18: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/18.jpg)
18
Ontology Construction Knowledge Reusing
• “What has been will be again, what has been done will be done again; there is nothing new under the sun.” (The Holy Bible, Ecclesiastes, 1:9, NIV translation)
• A “new” ontology is a new assembly with unions and projections of several pre-existed ontologies.
![Page 19: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/19.jpg)
19
Architecture on Dynamically Assembling
Domain of Interest
Web Page
(1)
(2)
(1) Knowledge-component selection
(2) Ontology assembly
……
Collection of KnowledgeSelected Knowledge Components
…
Assembled Ontology
…
![Page 20: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/20.jpg)
20
Thesis StatementPropose a new solution to perform semantic annotation on normal HTML web pages, specifically
1. apply ontology-based automatic IE techniques
2. augment OWL with knowledge recognition extension
3. combine conceptual annotator and layout-based annotator
4. assemble a new domain ontology for an annotation task dynamically
![Page 21: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/21.jpg)
21
Standard Evaluation• Annotation performance
– Precision– Recall– Speed of execution
• Testing bed– 5 ~ 10 different domains, with over 10
lexical concepts in each domain ontology– 20 ~ 50 web pages on each domain
![Page 22: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/22.jpg)
22
Ontology Converter Test
• A complete and sound checking is costly and difficult to implement.
• Our simple test– Start with an OSMX ontology AA– Covert it to OWL and then transform it back to be
OSMX ontology BB– Process both AA and BB to annotate a same set of web
pages (say 30 – 50 web pages)– Annotation results should be identical
![Page 23: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/23.jpg)
23
Two-Layer Annotation Model Evaluation
• Standard evaluation
• In addition– About five large web sites with
machine-generated web pages, each of which contains at least dozens of web pages
![Page 24: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/24.jpg)
24
Dynamic Ontology Assembler Evaluation
• Regular precision and recall study according to selected knowledge components
• A pilot study on when ontology assembler works better than manual ontology construction– Record the time to use a tool to create an ontology
from scratch– Record the time to assemble a same ontology– Compare their differences and the special conditions
for each case– Make empirical suggestions about how to build a
knowledge base that favors ontology assembly
![Page 25: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/25.jpg)
25
Delimitations• Automatic ontology creation from scratch
• Annotation storing, indexing, and sharing mechanisms
• Semantic annotation for multimedia content
• Parallel or distributional computing to further scale the semantic annotation system to a large number of web pages
![Page 26: Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d585503460f94a36fa8/html5/thumbnails/26.jpg)
26
Contributions• To convert current web pages into machine-understandable semantic
web pages
• Producing a pure ontology-driven semantic annotator using ontology-based IE wrapper
• Proposing a novel two-layer annotation model to do fast, accurate, and resilient annotation
• Studying a dynamic ontology assembler that helps maximize the reuse of existing knowledge and minimize the load of manual ontology creation
• Implementing an ontology converter so that this work is useful to the rest of the semantic web society.