[ieee 2009 fourth international conference on internet and web applications and services -...
TRANSCRIPT
A Practical System of Domain Ontology Learning Using the Web for Chinese
Fang TianFaculty of Engineering
The University of Tokushima2-1 Minamijosanjima
Tokushima 770-8506, [email protected]
Peilin Jiang, Fuji RenFaculty of Engineering
The University of Tokushima2-1 Minamijosanjima
Tokushima 770-8506, Japan{jiang, ren}@is.tokushima-u.ac.jp
Abstract
This paper proposes an ontology learning system modelbased on the Web search engine and Protege-OWL API,which emphasizes iterative learning approach by theextracted instances. We discuss taxonomic and non-taxonomic relationship learning separately in ontologylearning system, and investigate the importance of verb plusnoun phrase learning for extraction of activity concepts inChinese. We also propose an algorithm of relevance mea-surement for extracting relation instances by binary key-words based on co-occurrence statistics. Finally, we build apractical system of ontology learning through learning re-lation instances of the Chinese festival ontology, and testthe effectiveness of our method.
1. Introduction
Ontology is one concept presentment form that empha-
sizes the organization of the data relations. Ontology in
terms of computing science had been defined by T.R. Gru-
ber [1]. Ontology describes the relations of data which im-
proves further information merging and reasoning. There-
fore, organization of concept relations is important in ontol-
ogy construction.
In contrast to manual construction ontology, automatic
methods for ontology learning and population have been
proposed in recent literatures [2][3]. Ontology learning is
defined as the set of methods and techniques used for build-
ing an ontology from scratching, enriching, or adapting an
existing ontology in a semi-automatic fashion using several
sources [3], such as text, database, or the Web.
Ontology learning is driven by IE (information extrac-
tion) and data mining technology for domain knowledge
acquirement. In the last few years IE has been using the
whole Web as an extracting corpus [4]. It takes advantage
of the available Web search engines and the possibility of
accessing massive amounts of up-to-date information.
Relation extraction is promoted by the ACE (Automatic
Content Extraction) program. Its task is finding semantic
relations between two entities. One of goals of semantic
Web research is to represent the knowledge in an ontology
that can be shared and used by many applications. Relation
extraction and other IE technology provide a good support
for ontology learning automatically.
In this paper, we will give an architecture of ontology
learning system for Chinese, and discuss a method that ex-
tracts relation instances, and introduce procedural represen-
tation for loading the result in OWL (Web Ontology Lan-
guage). As an experiment, we extract custom (tradition)
instances for existing festival instances relatively.
The article is organized as follows. Section 2 discusses
the extraction of different relational concepts and the rela-
tion organization between extracted concepts in ontology
learning, moreover, proposes an ontology learning system
model based on the internet search engine and Protege-
OWL API technology. Section 3 mainly introduces statisti-
cal algorithm in ontology learning, and proposes a relevance
algorithm that uses binary keywords in learning relation in-
stances. Section 4 presents an experiment of learning rela-
tion instances for Chinese festival ontology. Section 5 eval-
uates and discusses the results of the experiment. Finally,
section 6 gives the conclusions and outlines directions for
further work.
2. Relations Learning
2.1. Ontology Relations and OntologyLearning
The entity relationship extraction may be divided into
two types according to the order of determining relations.
One is to discover the name entity pairs after determining
2009 Fourth International Conference on Internet and Web Applications and Services
978-0-7695-3613-2/09 $25.00 © 2009 IEEEDOI 10.1109/ICIW.2009.50
298
relationship, another is to discover the relations for name
entity pairs. Currently, in ontology learning, the relations
extraction is mainly to determine the relations first. Cur-
rently, ontology firstly organizes the concept in classes and
then organizes the concept relation between classes. By
analysis [5][7], there have three core subtasks in ontology
learning: lexical entry extraction (also used for concept ex-
traction), taxonomy extraction, and non-taxonomic extrac-
tion. In consequence, ontology relationship learning in-
clude taxonomic relationship learning which discover hi-
erarchical lexical entries, and non-taxonomic relationship
learning which organize relations of lexical entries.
Taxonomic relationship learning focuses on the auto-
matic acquisition of instances by hierarchical relationships
[4]. The addition of the relationships creates a hierarchi-
cal taxonomy such as the relation of is-a, is-a-subclass-of
and so on, the relations between an individual and a class
mainly. Non-taxonomic relationship learning focuses on
interlink between instances by the two classes [4]. The
non-taxonomic relationships are labeled as object proper-
ties of class between binary classes in an ontology, such as
“eat of”, “is eaten of”, “cause of”, and “part of” as so on.
The primary organization method of the non-taxonomic re-
lationship is using verbs to express a relation between two
classes mainly in domain ontology [4][5].
Normally, the order of ontology learning is taxonomic
relationship learning before non-taxonomic relationship
learning. In this paper, we propose a method learning rela-
tion instances [6] that extracts the relation instances of exist-
ing instances for domain ontology. The aim of learning rela-
tion instances is to extract instances of a non-taxonomic re-
lation that is predefined in ontology (for example, to extract
the capital for country), to complete the ontology learning
process in an iterative way driven by extracted instances.
2.2. System Architecture
We propose an ontology learning system for acquiring
the relational entities in OWL by the Protege system from
the Web. The Protege is an open-source ontology editor and
knowledge-base framework developed by Stanford Medical
Informatics. Fig.1 shows the architecture of our system us-
ing technologies of the Keyword-based Web search engines
and Protege-OWL API.
The system is composed by the four parts as follows.
The first part is definition and acquirement of domain key-
words based on domain ontology, and uses the keywords for
retrieval by the Web Search Engine API. The second part
presents the basic ontology learning that includes discover-
ing of concepts, organization of relationships or extracting
concepts of predefined relations based IE techniques. The
third part is refinement of the extracted instances and rela-
tions such as restriction representation, synonymy treatment
etc. post-processing of ontology by domain knowledge ex-
perts. The last part is to load and represent the obtained
results using a standard ontological language OWL by Pro-
tege1, in order to ease the reuse and interoperability.
There is database that serves as a temporarily data mem-
ory for process of IE in ontology learning system. The on-
tology learning uses the acquired ontology instances as new
retrieval keywords in iterative way. It uses a bootstrapping
of ontology learning to find more relational instances.
Web
Sea
rchE
ng
ine
AP
I
Database
Web
Refining of domain
knowledge
Domain
Ontology
Predefinitions of
domain keywords
Jen
aa
nd
Pro
tégé-O
WL
AP
I
Ontology loading
Extraction of
Concepts and Relations
Figure 1. Architecture of Ontology LearningSystem
3. Ontology Relation Learning Method
Ontology learning mainly discovers instances, learns
and organizes the relationships among instances automat-
ically. Currently, the primary method of taxonomic learn-
ing for English is present by Hearst (1992) that uses the
lexical-syntactic pattern-based approaches [8], and statis-
tical measure by co-occurrence frequency [9]. In non-
taxonomical relationships learning, one is pattern-based and
pattern learning approaches such as DIPRE (1999) [13] and
[14], another is to find the degree of relationship between a
pair of concepts and the given domain relation words based
co-occurrence statistics [5]. This paper presents a novel
method to extract relation instances for domain ontology
based on syntactic pattern and calculate relevance between
relation instance and relation keyword. Instead of one re-
lation keyword, we use binary keywords together with the
advanced search technology of Web search engine, thus our
method of relevance measurement for relation instances is
efficient.
3.1. Syntactic Patterns of Filtering
Basically scanning taxonomic relations instances based
the lexical-syntactic patterns [8], those patterns summarize
1http://protege.stanford.edu/
299
the most common ways of expressing specializations in En-
glish, such as “NP{, } including {NP, }∗{or|and} NP ”
(NP, noun phrase). The advantage is used to obtain the tax-
onomic hierarchy of terms. However, the quality of lexical-
pattern based extractions can be compromised by the prob-
lems of decontextualisations and ellipsis. Another pattern-
based approach for detecting specialisations is the use of
phrases only NP [4].
We discovered that a part of activity information is
mainly composed by verb plus noun phrase (V+NP), verb
(V), and verb noun (VN) that has the property of noun etc.
in Chinese, such as “Fang BianPao ( light firework)” and
“BaiNian (pay new year call )”. They are organized in quite
fixed way relatively and are not easy to lead ambiguity.
In this paper, we extract the activity entities using
pattern-based approach for detecting specializations by the
use of V+NP, V or VN etc.. We choose to use the latter
kind of pattern-based approach, because it does not have the
special lexical restraint, the recall information is relatively
complete.
3.2. Statistics Based Relevance
A typical score measure of co-occurrence between an
initial word (problem) and a related candidate concept
(choice) presented by Turney [9] is shown in equation
1. The score of relevance is derived from PMI(pointwise
mutual information) analysis. Here, hits(problem AND
choice) is the number of documents that contain both
problem and choicei, and then hits(choicei) denotes the
number of documents that contain choicei alone. The ratio
of these two numbers is the score for choicei.
Score(choicei) =hit(problem AND choicei)
hit(choicei)(1)
Our algorithm of relevance is proposed based on this
method 1 that measures for one choice by association bi-
nary problems. The formula 2 is the degree of relation-
ship between a Candidate relation instance(Candidate)
and each domain keyword(Keyword), which can be mea-
sured through a combination of queries by Web search en-
gine.
Scorei(Candidatej) =|Keywordi ∩ Candidatej |
|Candidatej | (2)
Keywordi(i = 1, 2) denotes the proposed domain key-
words. |Keywordi ∩Candidatej | denotes the value of to-
talResultsAvailable of both Keywordi and Candidatej(j= 1, 2, ..., n), |Candidatej | denotes the value of total-
ResultsAvailable of only candidate relation instance. The
ratio of these two values is the Score for candidate rela-
tion instance. TotalResultsAvailable is the number of query
matches in the database of Web Search services.
The domain keywords in our work are composed by a
determined class instance and determined relations. In our
proposed ontology system, domain keywords can be se-
lected from the building domain ontology automatically by
SPARQL backed by W3C [11].
The experiment discovers that the search result is re-
fined and more accurate using the advanced preferences
of Web Search APIs , such as delimit exact search with
double quotation marks (“...”). As this approach requires
creating queries including Keywordi and Candidatej or
Candidatej alone, and queries the totalResultsAvailable by
a simple sentence or a phrase with double quotation marks
from the Web.
For all candidate relation instance computed by formula
2, we computed an average value AV G(scorei) by formula
3 for each Keywordi. We used ki · AV G(scorei) as the
threshold value for relation instance, where ki (0 ≤ ki ≤ 1)
denotes a constraint factor related to specific domain. In
this article, we get two thresholds from the same candidates
of two keywords. If two thresholds are satisfied at the same
time, we extract it as relation instance.
AV G(Scorei) =n∑
j=1
Scorei(Candidatej) (3)
4. Experiment and Consideration
In the previous section, we have introduced our method
to extract relation instances for domain ontology building.
In experiment, we choose Chinese traditional festival to set
up domain ontology. As an example, we extract the related
instances of the class Festival Custom for instances of
the class Festival.
4.1. The steps of extraction
The extraction of relation instances is composed by five
steps in our system: acquirement of the extracting re-
sources; linguistics pre-processing; filtering by properties
rules, relevance calculation and determination of threshold
and extraction.
Step 1. Collection of the extracting resources. Firstly we
retrieve relevant co-occurrence documents by two domain
keywords from the Web search engine, and save the docu-
ments in XML files. In our experiment, the average retrieval
content is 500 web pages for each festival by Yahoo search
API 2.
2http://developer.yahoo.com/search/web/
300
The domain keywords for extracting relation instances
are composed of a determined class instance and deter-
mined relations. In this experiment, Keyword1 is one in-
stance of the class Festival, for example “Duan WuJie
(Dragon Boat Festival)”, “ChunJi (Chinese New Year)”;
Keyword2 is name “FenSu (Custom)” that is a upper class
of Festival Custom in Chinese festival ontology. Each of
festival will link to “Custom” to retrieve the relevant infor-
mation from the Web, for example, the two keywords by
“Dragon Boat Festival” and “Custom”, or “Chinese New
Year” and “Custom”, and so on.
Step 2. Linguistics pre-processing. Parse XML files and
remove HTML tags, get relative text; Divide the text by one
sentence unit; Perform sentence segmentation using ICT-
CLAS4J 3 which is developed by Chinese Academy of Sci-ences.
Step 3. Filtering by property rules. Extract the co-
occurrence terms as candidate relation instances by using
property filtering rules based on the syntactic patterns from
the collected relevant data.
The festival customs mostly contain activity and specific
food and drinking. We extract different customs as activ-
ity entities based on the detecting specializations using the
V+NP, V and VN in Chinese. At the same time, We find that
the activity entities by V or VN is not one-syllabic words in
Chinese generally, for example, using the two-syllabic verb
“JiZu (Worshipping the Ancestors)” instead of one-syllabic
verb “Ji (Worshipping)”.
Step 4. Relevance calculation. Measure the relevance
for each candidate relation instances with two keywords
through proposed formula 2.
In our system, the queries of |Keywordi∩Candidatej |and |Candidatej | are with the quotation marks. For ex-
ample, “Keywordi Candidatej”. By syntactic analysis of
simple sentence, S =⇒ NP + V P (V P =⇒ V + NP ).
We can consider Keyword1 is instance of Festival class
as [NP Subject], and join with Candidatej (in V+ [NP
Object]) as in a sentence or phrase to retrieve by Web
search engine. Because it is hierarchical taxonomy between
Candidatej and Keyword2 “FengSu (Custom)”, we can
use “De” (similar to “of” in English) as link between the
Candidatej and the Keyword2.
As an example of query that is composed of Candidatej
and Keywordi, a keyword of festival “Chinese New Year”
join with Candidate custom instance “Light Firework”,
forming “Chinese New Year Light Firework” in Chinese
syntax. The query for the keyword “Custom” join with
“Light Firework”, such as “Light Firework De Custom” in
Chinese.
Step 5. Determination of threshold and extraction. In
experiment we calculate the scores for both Keyword1
and Keyword2, which are denoted by Score1 and Score2.
3http://code.google.com/p/ictclas4j/
The corresponding constraint factors of threshold value are
k1 and k2. The candidate relation instance whose rela-
tion scores are bigger than corresponding thresholds ki ·AV G(scorei) is extracted as relation instance.
4.2. Relation Instances Instantiation
In the Chinese festival ontology, we extract customs of
the relation “has custom” for instances of Festival. Ta-
ble 1 presents some examples of extracted relation instances
for one of Festival instance “Dragon Boat Festival”. We
show the extracted relation instances’ names, scores of
relevancy measured by the two keywords, and evaluation
(1=‘correct’, 0=‘incorrect’).
There have some noise terms related to origin and object
of custom, such as term “BiXie(Drive off Evil Spirits)” for
driving off evil spirits and pestilence from our lives.
Table 1. Samples of extracted relation in-stances for “Dragon Boat Festival”
Relation Instances Score1 Score2 Evaluation
Eat Rice Dumplings(Zongzi) 0.00021 0.00192 1
Commemoration of Qu Yuan 0.01282 0.75991 1
Dragon Boat Race 0.05692 0.55000 1
Make Rice Dumplings 0.00022 0.00192 1
Wear Fragrant Sachets 4.048E-05 0.01724 1
Drink XiongHuang Wine 0.00508 0.91094 1
Hang Branches of Moxa 0.00214 0.15071 1
Drive Away the Five Poisonous Pests 0.00196 0.01274 1
Hang Calamus 0.01098 0.97802 1
Drive off Evil Spirits 0.00018 0.08076 0
Worshipping the Ancestors 0.00701 0.04008 1
... ... ... ...
4.3. Ontology Building by SPARQL
The Protege-OWL API is an open-source Java library for
the OWL and RDF(S). The API provides classes and meth-
ods to load and save OWL files, to query and manipulate
OWL data models [12]. We integrate a Jena-based SPARQL
query engine into Protege by Protege-OWL API, and load
automatically the extracted instances in OWL in Chinese.
As initializtion, we defined classes Custom,
Festival Custom and Festival, the object property
“has custom” of Festival class, and some festival
instances in Chinese festival ontology by protege. Each of
Festival instance as one of domain Keyword1 is selected
from the domain ontology by “SELECT” command
of SPARQL. At the same time, Keyword2 “FengSu
(Custom)” is acquired from the predefined owlModel of the
ontology.
When the relation instances are inserted into the domain
ontology in OWL, the relation instances belongs to the up-
per class, it is also the object property values for the relation
301
class. The festival customs in our experiments are not only
the instances of the Festival Custom class, but also the
values of the object property “has custom” of Festivalinstances. Fig.2 shows the process of Ontology loading by
SPARQL. Fig.3 shows part of the results representation in
OWL for “Dragon Boat Festival”.
// To get the relation class Festival_Custom
String ns = "http://www.owl-ontologies.com/.../ ChineseF.owl";
Resource Festival_Custom = owlModel.getResource( ns
+ "#Festival_Custom" );
// To get the relation “has_custom”
has_custom =owlModel.getObjectProperty(ns
+ "#has_custom ");
// To get one of instances festival
// n is the number of the instances of Festival class
for(int i=0;i<n;i++)
{
estival =owlModel.getIndividual(ns + "#festival ");
// To load each relation instance form festival ’s DB
// m is number of extracted instances of festival
for(int j=0;j<m;j++)
{
// Insert the instances for Festival_Custom class
Instance =owlModel.createIndividual( ns + "#"+
instance[j], Festival_Custom);
// Insert the value for object property has_custom
of the festival
festival .addProperty(has_custom, instance );
}
}
Figure 2. Loading the relation instances bySPARQL
5. Evaluation and Discussion
In this section, we evaluate the experimental results in
four festivals. We have extracted 71 relation instances from
4858 candidate instances which are filtered based on the
syntactic patterns. Our method has obtained the precision,
recall and F-measure as shown in Table 2, the ratio of cor-
rect relation instances to extracted relation instances for the
precision, and ratio of correct relation instances to correct
relation instances manually for the recall.
The average F-measure value of four festivals is 0.854,
the extracted Festival instances have represented the com-
mon Festival Custom instances. The results are calcu-
lated by the festival customs extracted manually from the
Web.
In the experiment, we obtained efficient extraction re-
sults using the general Web search engine and simple syn-
<?xml version="1.0"?>
<rdf:RDF
…
xmlns="http://www.owl-ontologies.com/2008/2/10/ ChineseF.owl#"
…
<owl:Class rdf:ID="Festival_Custom"/>
<owl:Class rdf:ID="Festival"/>
<owl:ObjectProperty rdf:ID="has_custom">
<rdfs:domain rdf:resource="#Festival"/>
<rdfs:range rdf:resource="#Festival_Custom"/>
</owl:ObjectProperty>
<Festival_Custom rdf:ID=" "/> // " "Wear Fragrant Sachets
<Festival_Custom rdf:ID=" "/> // " "Eat Rice Dumplings(Zongzi)
…
<Festival rdf:ID="Dragon Boat Festival">
<has_custom rdf:resource="# "/>
<has_custom rdf:resource="# "/>
…
</Festival>
</rdf:RDF>
…
Figure 3. An example representation of theextracted relation instances for “Dragon BoatFestival” in OWL
Table 2. Precision, recall, and f-measure ofextraction for relation instances of four fes-tivals
Festival Instances Precision(%) Recall(%) F-measure
Dragon Boat Festival 91.3 80.7 0.856
Lantern Festival 84.2 76.1 0.799
Qingming Festival 93.7 88.2 0.908
Chongyang Festival 69.2 100 0.818
tax analysis. But if we had not used a simple sentence or
a phrase with double quotation marks (“...”) from the Web,
the average extraction precision by our algorithm would be
no more than 53%.
Table 3 shows the average precision, recall, and F-
measure of the extracted results using combination key-
words or each keyword separately for corresponding the
festivals showed in the Table 2. As we proposed about ex-
traction of relation instances, the extracted instances is not
only as instance of class, but also have some relationships
with another class, therefore, the results are relevant to two
class information. It obtained better performance using bi-
nary keywords than only one keyword relatively. The exper-
imental results demonstrate the efficiency of the proposed
method.
302
Table 3. The average precision, recall, and f-measure of extraction for relation instancesby different keyword number
Keyword Precision(%) Recall(%) F-measure
Two Keywords 85.6 86.2 0.854
Keyword1 86.1 72.3 0.785
Keyword2 79.3 36.7 0.501
6. Conclusions and Further Research
We proposed an ontology learning system based the key-
word search engine of Internet and Protege-OWL API and
presented a practical system of ontology learning through
learning relation instances by the Chinese festival ontology.
The system acquired the information from the Web and built
ontology with a standard ontological language OWL by pro-
tege. At the same time, we described the procedure to load
automatically the extracted instances in OWL by SPARQL.
We described two stages of learning relation instances,
one is filtering of concepts from the Web based syntac-
tic patterns, another is calculating the relevance for filtered
candidate instances. The specification of our method lies in
three facets. Firstly, the extraction rules of the activity enti-
ties are pattern-based approach for detecting specializations
using such as V+NP, V or VN alone. The recall results are
complete relatively without special lexical restraint. Sec-
ondly, each extracting relation instance is measured with the
binary domain keywords. Finally, the query for computing
the relevance used some words such ”De”, and with double
quotation marks is proposed to improve the accuracy of the
extraction.
In the future, firstly we will introduce deeper syntactic
properties to obtain new knowledge. For example, we can
divide the extracted customs into eating and drinking cus-
toms, and others activity customs in the Custom class, and
derive the name of the food related to the festival to enrich
the domain ontology. Then we will extract the relation in-
stances for independent domain ontology, and extract the
relations for existing instances automatically.
Acknowledgments.
This research has been partially supported by the Ministry
of Education, Science, Sports and Culture, Grant-in-Aid for
Scientific Research, 19300029.
References
[1] T.R. Gruber, Towards Principles for the Design of Ontologies
Used for Knowledge Sharing, In Proceeding of the Interna-
tional Workshop on Formal Ontology, pp. 907-928, Padova,
Italy, August 1993.[2] R. Navigli, P. Velardi, Learning Domain Ontologies from
Document Warehouses and Dedicated Web Sites, Computa-tional syntactics, MIT press, 30(2):151-179, June 2004.
[3] A. Gomez-Perez, D. Manzano-Macho, A survey of ontology
learning methods and Techniques, OntoWeb project, Deliver-
able D.1.5, June 2003.[4] R.D. Sanchez, Domain Ontology Learning from
Web, http://www.tdx.cbuc.es/TESIS_UPC/AVAILABLE/TDX-0122108-103125, 2007.
[5] M. Kavalec, A. Maedche, and V. Svatek, Discovery of Lexical
Entries for Non-taxonomic Relations in Ontology Learning,
In Proceedings of SOFSEM 2004,Springer Berlin / Heidel-
berg, LNCS 2932 , 249-256, 2004.[6] V.D. Boer, M.V. Someren, and B.J. Wielinga, Relation Instan-
tiation for Ontology Population using the Web, In. Proceed-ings of the 29th Annual German Conference on Artificial In-telligence, KI 2006, Springer Berlin / Heidelberg, pp.202-213.
Bremen, Germany, June 2006.[7] A. Maedche, S. Staab, Ontology Learning for the Semantic
Web, IEEE Intelligent Systems, IEEE Educational Activities
Department, 16(2): 72-79, March 2001.[8] M.A. Hearst, Automatic acquisition of hyponyms from large
text corpora, In Proceedings of the 14th International Con-ference on Computational syntactics (COLING-92), Nantes,
France, pp. 539-545, 1992.[9] P.D. Turney, Mining the Web for synonyms: PMI-IR ver-
sus LSA on TOEFL, In Proceedings of the Twelfth Euro-pean Conference on Machine Learning, Freiburg, Germany,
pp. 491-499, 2001.[10] M.K. Smith, C. Welty, and D.L. McGuinness, OWL
Web Ontology Language Guide, W3C Recommenda-
tion 10 February, http://www.w3.org/TR/2004/REC-owl-guide-20040210/#FeatureList, 2004.
[11] E. Prud’hommeaux, A. Seaborne, SPARQL Query
Language for RDF, http://www.w3.org/TR/rdf-sparql-query/, W3C Recommendation, 15
January, 2008.[12] H. Knublauch, Protege staff members, protege-owl api
programmer’s guide, http://protege.stanford.edu/plugins/owl/api/guide.html, Last updated:
September 21, 2006.[13] S. Brin, Extracting patterns and relations from the world
wide web, Lecture Notes in Computer Science,1590: 172-
183, 1999.[14] S. Sahay, S. Mukherjea, E. Agichtein, E. V. Garcia, S.
B. Navathe, and A. Ram, Discovering semantic biomedi-
cal relations utilizing wide web, ACM Trans. Knowl. Discov.Data,ACM 1556-4681, 2(1): 1-15, March 2008.
303