corporum-ontoextract ontology extraction tool author: robert engels company: cognit a.s

27
CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Upload: alexandrina-lyons

Post on 13-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

CORPORUM-OntoExtract

Ontology Extraction Tool

Author: Robert Engels

Company: CognIT a.s

Page 2: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Overview

1. On-To-Knowledge project

2. CORPORUM

3. CORPORUM-OntoExtract

4. Discussion

5. Conclusion

Page 3: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

What is On-To-Knowledge (OTK) project?

Goals: develop tools and methods for supporting

knowledge management relying on sharable and

reusable knowledge ontologies. The technical

backbone of On-To-Knowledge is the use of

ontologies for the various tasks of information

integration and mediation.

Page 4: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

What is On-To-Knowledge (OTK) project?

• European project in EU Information Society

Technologies (IST) Program: EU-IST-10132

• Duration: 2.5 years, January 2000 - June 2002

• Total effort & cost: 26 personyears, 2.5+ M EUR

• Partners:

1. CognIT a.s 2. AIdministrator3. AIFB (University of Karlsruhe)4. BT Research5. Enersearch 6. Swiss Life Information Systems Research Group

Page 5: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

CognIT a.s

• Established in Halden, Norway in 1996.

• 20 employees - 3 with PhD

• CORPORUMTM

• Develops Technology for:

1. intelligent search by means of agents

2. text analysis and extraction

3. structuring and fusing data to build knowledge

4. knowledge bases and feedback of experience

5. data mining and text mining

Page 6: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

On-to-Knowledge workbench

• CORPORUM-OntoExtract: extract ontologies

from unstructured documents and represent

them in XML/RDF/OWL

• CORPORUM-OntoWrapper: extract ontologies

from structured documents and represent them

in XML/RDF/OWL

• RDF-DB (Sesame)

• RDF-Ferret: interface between users and RQL

• OntoEdit (Ontology Editor)

• RQL engine: query RDF-DB

• DAML-OIL: representation language

Page 7: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

The OnToKnowledge system architecture

Page 8: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Introduction of CORPORUM

CORPORUM is a tool for information retrieval and extraction developed by CognIT a.s.

• crawl the internet and intranet

• analyzing relevance and content

• maintain knowledge base (RDF-DB)

• focus on the content

• searches, cataloguing, summaries and extractions can be performed according to user interests

• founded on CognlT’s Mimir technology

Features:

Page 9: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

The overall CORPORUM architecture

Page 10: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Introduction of CORPORUM

Core technology -- MIMIR includes:

• Linguistic analysis through all levels and generate user interested ontology in RDF.

• Similar analysis: obtain documents which are most pertinent to a specific analyzed text. (information retrieval and extraction)

Page 11: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

“Classical” Natural Language processing decomposed.

Page 12: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Mimir architecture

Page 13: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Informaton distribution

Introduction of CORPORUM

Histogram showing where the desired content in the document can be found and to what degree it is pertinent.

Page 14: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

CORPORUM-OntoExtract:

•The web-based version of a CORPORUM version

•Use same architecture as the CORPORUM

•Extract ontologies from unstructured web pages

•Represent extracted ontologies in XML/RDF/OIL

Page 15: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

CORPORUM-OntoExtract:

• CMOntoBuild: taken care of overall control of the system and co-ordinating all information flows

• CMWebHandler: responisble for collecting all (text-) documents from a specific site

• CMCogLib: analysis texts, extracts information, exports a variety of formats

• CMLexEn: language dependent support module for CMCoglib

• CMWebInteract: communication component that takes care of all interaction of CORPORUM-OntoExtract with the RDF database. Responsible for querying the RDF-DB, as well as submitting final analysis results.

• DOMhandler: integrated in CMWebInteract, the OpenXML DOM handler takes care of the interpretation of the results which are returned from the RDF server

Page 16: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

CORPORUM-OntoExtract performs the following tasks:

•CMOntoBuild is invoked by the user

•CMWebHandler is invoked by CMOntoBuild

•CMWebHandler retrieves the domain that is specified from the intra/internet and returns it to CMOntoBuild

•CMOntoBuild passes texts to the CMCoglib that analyses, interprets and extracts information from these texts, and returns a basic RDF representation to CMOntoBuild

•CMOntoBuild now analyses the generated RDF and queries the RDF Ontology repository to try to find knowledge that can augment the previously generated RDF

•When all querying that could be performed is done, and the RDF is augmented, the final RDF ontology for a specific document is sent to the RDF server together with areference to the original text.

Page 17: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Client/Server based System Architecture of CORPORUM-OntoExtract

Page 18: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

The overall CORPORUM architecture

Page 19: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

CORPORUM-OntoExtract output:

• Namespace definitions

• Dublin Core based metadata

• Property definitions

• Ontology

• Facts/instances

• Cross-taxonomic relations

Page 20: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Content in natural language vs. content in structure

• CORPORUM-OntoExtracte can capture content without

considering the layout and structure of the texts.

• In some cases, the structure of texts has to be considered.

Contracts, licenses.

• CORPORUM-OntoWrapper

Discussion on use of CORPORUM technology in OntoExtract

Page 21: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Diversity of web pages (unknown intention)

• Diversity of documents on the web

• It is difficult to analyze a text according to the

intention of the writers

• Combination of CORPORUM-OntoExtract with

CORPORUM-OntoWrapper might some of these

issues

Discussion on use of CORPORUM technology in OntoExtract

Page 22: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Representational issues (A-box vs. T-box reasoning)

• TBox: Tbox consists of (class) concept inclusion axioms

(and/or equivalence) -- e.g., "C subsumes D“.

• ABox: Abox consists of individual/tuple membership

axioms - e.g., "x is an instance of C" or "<x,y> is an

instance of R".

• Most of the CORPORUM-OntoExtract generated knowledge is

TBox knowledge.

Discussion on use of CORPORUM technology in OntoExtract

Page 23: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Domain specificity of extracted knowledge

• Since the ontologies are extracted from specified domains,

the extracted information is expected to be restricted in

these domains.

• Positive: while many of the searches will also be rather

domain specific, and knowledge about cross-taxonomic

relations might come in very handy.

• Negative: one may like to build up domain independent

knowledge bases.

Discussion on use of CORPORUM technology in OntoExtract

Page 24: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Conclusion

• CORPORUM helps web become more semantic.

• Semantic-based technology.

• Enhance usability of formal knowledge

representations for end-users

• Decrease initial efforts when defining an

ontology in new domains

Page 25: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

• Dynamicity of the analysis, i.e. ease of use in

dynamic environments

• Offer new ways of navigating knowledge bases and

documents sets by visualization of contents and by

means of semantic-based, graphic structures

• Extract of content-based meta-data from

documents, such as important concepts, semantic

structures, etc.

• Ability to offer domain-specific information as

related-keywords

Conclusion

Page 26: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s

Comments

• Description is too general. No examples and details.

• Weak sentences. Complicate sentence structures.

Page 27: CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s