[ieee 2009 fourth international conference on internet and web applications and services -...

A Practical System of Domain Ontology Learning Using the Web for Chinese

Fang TianFaculty of Engineering

The University of Tokushima2-1 Minamijosanjima

Tokushima 770-8506, [email protected]

Peilin Jiang, Fuji RenFaculty of Engineering

The University of Tokushima2-1 Minamijosanjima

Tokushima 770-8506, Japan{jiang, ren}@is.tokushima-u.ac.jp

Abstract

This paper proposes an ontology learning system modelbased on the Web search engine and Protege-OWL API,which emphasizes iterative learning approach by theextracted instances. We discuss taxonomic and non-taxonomic relationship learning separately in ontologylearning system, and investigate the importance of verb plusnoun phrase learning for extraction of activity concepts inChinese. We also propose an algorithm of relevance mea-surement for extracting relation instances by binary key-words based on co-occurrence statistics. Finally, we build apractical system of ontology learning through learning re-lation instances of the Chinese festival ontology, and testthe effectiveness of our method.

1. Introduction

Ontology is one concept presentment form that empha-

sizes the organization of the data relations. Ontology in

terms of computing science had been defined by T.R. Gru-

ber [1]. Ontology describes the relations of data which im-

proves further information merging and reasoning. There-

fore, organization of concept relations is important in ontol-

ogy construction.

In contrast to manual construction ontology, automatic

methods for ontology learning and population have been

proposed in recent literatures [2][3]. Ontology learning is

defined as the set of methods and techniques used for build-

ing an ontology from scratching, enriching, or adapting an

existing ontology in a semi-automatic fashion using several

sources [3], such as text, database, or the Web.

Ontology learning is driven by IE (information extrac-

tion) and data mining technology for domain knowledge

acquirement. In the last few years IE has been using the

whole Web as an extracting corpus [4]. It takes advantage

of the available Web search engines and the possibility of

accessing massive amounts of up-to-date information.

Relation extraction is promoted by the ACE (Automatic

Content Extraction) program. Its task is finding semantic

relations between two entities. One of goals of semantic

Web research is to represent the knowledge in an ontology

that can be shared and used by many applications. Relation

extraction and other IE technology provide a good support

for ontology learning automatically.

In this paper, we will give an architecture of ontology

learning system for Chinese, and discuss a method that ex-

tracts relation instances, and introduce procedural represen-

tation for loading the result in OWL (Web Ontology Lan-

guage). As an experiment, we extract custom (tradition)

instances for existing festival instances relatively.

The article is organized as follows. Section 2 discusses

the extraction of different relational concepts and the rela-

tion organization between extracted concepts in ontology

learning, moreover, proposes an ontology learning system

model based on the internet search engine and Protege-

OWL API technology. Section 3 mainly introduces statisti-

cal algorithm in ontology learning, and proposes a relevance

algorithm that uses binary keywords in learning relation in-

stances. Section 4 presents an experiment of learning rela-

tion instances for Chinese festival ontology. Section 5 eval-

uates and discusses the results of the experiment. Finally,

section 6 gives the conclusions and outlines directions for

further work.

2. Relations Learning

2.1. Ontology Relations and OntologyLearning

The entity relationship extraction may be divided into

two types according to the order of determining relations.

One is to discover the name entity pairs after determining

2009 Fourth International Conference on Internet and Web Applications and Services

978-0-7695-3613-2/09 $25.00 © 2009 IEEEDOI 10.1109/ICIW.2009.50

298

relationship, another is to discover the relations for name

entity pairs. Currently, in ontology learning, the relations

extraction is mainly to determine the relations first. Cur-

rently, ontology firstly organizes the concept in classes and

then organizes the concept relation between classes. By

analysis [5][7], there have three core subtasks in ontology

learning: lexical entry extraction (also used for concept ex-

traction), taxonomy extraction, and non-taxonomic extrac-

tion. In consequence, ontology relationship learning in-

clude taxonomic relationship learning which discover hi-

erarchical lexical entries, and non-taxonomic relationship

learning which organize relations of lexical entries.

Taxonomic relationship learning focuses on the auto-

matic acquisition of instances by hierarchical relationships

[4]. The addition of the relationships creates a hierarchi-

cal taxonomy such as the relation of is-a, is-a-subclass-of

and so on, the relations between an individual and a class

mainly. Non-taxonomic relationship learning focuses on

interlink between instances by the two classes [4]. The

non-taxonomic relationships are labeled as object proper-

ties of class between binary classes in an ontology, such as

“eat of”, “is eaten of”, “cause of”, and “part of” as so on.

The primary organization method of the non-taxonomic re-

lationship is using verbs to express a relation between two

classes mainly in domain ontology [4][5].

Normally, the order of ontology learning is taxonomic

relationship learning before non-taxonomic relationship

learning. In this paper, we propose a method learning rela-

tion instances [6] that extracts the relation instances of exist-

ing instances for domain ontology. The aim of learning rela-

tion instances is to extract instances of a non-taxonomic re-

lation that is predefined in ontology (for example, to extract

the capital for country), to complete the ontology learning

process in an iterative way driven by extracted instances.

2.2. System Architecture

We propose an ontology learning system for acquiring

the relational entities in OWL by the Protege system from

the Web. The Protege is an open-source ontology editor and

knowledge-base framework developed by Stanford Medical

Informatics. Fig.1 shows the architecture of our system us-

ing technologies of the Keyword-based Web search engines

and Protege-OWL API.

The system is composed by the four parts as follows.

The first part is definition and acquirement of domain key-

words based on domain ontology, and uses the keywords for

retrieval by the Web Search Engine API. The second part

presents the basic ontology learning that includes discover-

ing of concepts, organization of relationships or extracting

concepts of predefined relations based IE techniques. The

third part is refinement of the extracted instances and rela-

tions such as restriction representation, synonymy treatment

etc. post-processing of ontology by domain knowledge ex-

perts. The last part is to load and represent the obtained

results using a standard ontological language OWL by Pro-

tege1, in order to ease the reuse and interoperability.

There is database that serves as a temporarily data mem-

ory for process of IE in ontology learning system. The on-

tology learning uses the acquired ontology instances as new

retrieval keywords in iterative way. It uses a bootstrapping

of ontology learning to find more relational instances.

Web

Sea

rchE

ng

ine

AP

I

Database

Web

Refining of domain

knowledge

Domain

Ontology

Predefinitions of

domain keywords

Jen

aa

nd

Pro

tégé-O

WL

AP

I

Ontology loading

Extraction of

Concepts and Relations

Figure 1. Architecture of Ontology LearningSystem

3. Ontology Relation Learning Method

Ontology learning mainly discovers instances, learns

and organizes the relationships among instances automat-

ically. Currently, the primary method of taxonomic learn-

ing for English is present by Hearst (1992) that uses the

lexical-syntactic pattern-based approaches [8], and statis-

tical measure by co-occurrence frequency [9]. In non-

taxonomical relationships learning, one is pattern-based and

pattern learning approaches such as DIPRE (1999) [13] and

[14], another is to find the degree of relationship between a

pair of concepts and the given domain relation words based

co-occurrence statistics [5]. This paper presents a novel

method to extract relation instances for domain ontology

based on syntactic pattern and calculate relevance between

relation instance and relation keyword. Instead of one re-

lation keyword, we use binary keywords together with the

advanced search technology of Web search engine, thus our

method of relevance measurement for relation instances is

efficient.

3.1. Syntactic Patterns of Filtering

Basically scanning taxonomic relations instances based

the lexical-syntactic patterns [8], those patterns summarize

1http://protege.stanford.edu/

299

the most common ways of expressing specializations in En-

glish, such as “NP{, } including {NP, }∗{or|and} NP ”

(NP, noun phrase). The advantage is used to obtain the tax-

onomic hierarchy of terms. However, the quality of lexical-

pattern based extractions can be compromised by the prob-

lems of decontextualisations and ellipsis. Another pattern-

based approach for detecting specialisations is the use of

phrases only NP [4].

We discovered that a part of activity information is

mainly composed by verb plus noun phrase (V+NP), verb

(V), and verb noun (VN) that has the property of noun etc.

in Chinese, such as “Fang BianPao ( light firework)” and

“BaiNian (pay new year call )”. They are organized in quite

fixed way relatively and are not easy to lead ambiguity.

In this paper, we extract the activity entities using

pattern-based approach for detecting specializations by the

use of V+NP, V or VN etc.. We choose to use the latter

kind of pattern-based approach, because it does not have the

special lexical restraint, the recall information is relatively

complete.

3.2. Statistics Based Relevance

A typical score measure of co-occurrence between an

initial word (problem) and a related candidate concept

(choice) presented by Turney [9] is shown in equation

1. The score of relevance is derived from PMI(pointwise

mutual information) analysis. Here, hits(problem AND

choice) is the number of documents that contain both

problem and choicei, and then hits(choicei) denotes the

number of documents that contain choicei alone. The ratio

of these two numbers is the score for choicei.

Score(choicei) =hit(problem AND choicei)

hit(choicei)(1)

Our algorithm of relevance is proposed based on this

method 1 that measures for one choice by association bi-

nary problems. The formula 2 is the degree of relation-

ship between a Candidate relation instance(Candidate)

and each domain keyword(Keyword), which can be mea-

sured through a combination of queries by Web search en-

gine.

Scorei(Candidatej) =|Keywordi ∩ Candidatej |

|Candidatej | (2)

Keywordi(i = 1, 2) denotes the proposed domain key-

words. |Keywordi ∩Candidatej | denotes the value of to-

talResultsAvailable of both Keywordi and Candidatej(j= 1, 2, ..., n), |Candidatej | denotes the value of total-

ResultsAvailable of only candidate relation instance. The

ratio of these two values is the Score for candidate rela-

tion instance. TotalResultsAvailable is the number of query

matches in the database of Web Search services.

The domain keywords in our work are composed by a

determined class instance and determined relations. In our

proposed ontology system, domain keywords can be se-

lected from the building domain ontology automatically by

SPARQL backed by W3C [11].

The experiment discovers that the search result is re-

fined and more accurate using the advanced preferences

of Web Search APIs , such as delimit exact search with

double quotation marks (“...”). As this approach requires

creating queries including Keywordi and Candidatej or

Candidatej alone, and queries the totalResultsAvailable by

a simple sentence or a phrase with double quotation marks

from the Web.

For all candidate relation instance computed by formula

2, we computed an average value AV G(scorei) by formula

3 for each Keywordi. We used ki · AV G(scorei) as the

threshold value for relation instance, where ki (0 ≤ ki ≤ 1)

denotes a constraint factor related to specific domain. In

this article, we get two thresholds from the same candidates

of two keywords. If two thresholds are satisfied at the same

time, we extract it as relation instance.

AV G(Scorei) =n∑

j=1

Scorei(Candidatej) (3)

4. Experiment and Consideration

In the previous section, we have introduced our method

to extract relation instances for domain ontology building.

In experiment, we choose Chinese traditional festival to set

up domain ontology. As an example, we extract the related

instances of the class Festival Custom for instances of

the class Festival.

4.1. The steps of extraction

The extraction of relation instances is composed by five

steps in our system: acquirement of the extracting re-

sources; linguistics pre-processing; filtering by properties

rules, relevance calculation and determination of threshold

and extraction.

Step 1. Collection of the extracting resources. Firstly we

retrieve relevant co-occurrence documents by two domain

keywords from the Web search engine, and save the docu-

ments in XML files. In our experiment, the average retrieval

content is 500 web pages for each festival by Yahoo search

API 2.

2http://developer.yahoo.com/search/web/

300

The domain keywords for extracting relation instances

are composed of a determined class instance and deter-

mined relations. In this experiment, Keyword1 is one in-

stance of the class Festival, for example “Duan WuJie

(Dragon Boat Festival)”, “ChunJi (Chinese New Year)”;

Keyword2 is name “FenSu (Custom)” that is a upper class

of Festival Custom in Chinese festival ontology. Each of

festival will link to “Custom” to retrieve the relevant infor-

mation from the Web, for example, the two keywords by

“Dragon Boat Festival” and “Custom”, or “Chinese New

Year” and “Custom”, and so on.

Step 2. Linguistics pre-processing. Parse XML files and

remove HTML tags, get relative text; Divide the text by one

sentence unit; Perform sentence segmentation using ICT-

CLAS4J 3 which is developed by Chinese Academy of Sci-ences.

Step 3. Filtering by property rules. Extract the co-

occurrence terms as candidate relation instances by using

property filtering rules based on the syntactic patterns from

the collected relevant data.

The festival customs mostly contain activity and specific

food and drinking. We extract different customs as activ-

ity entities based on the detecting specializations using the

V+NP, V and VN in Chinese. At the same time, We find that

the activity entities by V or VN is not one-syllabic words in

Chinese generally, for example, using the two-syllabic verb

“JiZu (Worshipping the Ancestors)” instead of one-syllabic

verb “Ji (Worshipping)”.

Step 4. Relevance calculation. Measure the relevance

for each candidate relation instances with two keywords

through proposed formula 2.

In our system, the queries of |Keywordi∩Candidatej |and |Candidatej | are with the quotation marks. For ex-

ample, “Keywordi Candidatej”. By syntactic analysis of

simple sentence, S =⇒ NP + V P (V P =⇒ V + NP ).

We can consider Keyword1 is instance of Festival class

as [NP Subject], and join with Candidatej (in V+ [NP

Object]) as in a sentence or phrase to retrieve by Web

search engine. Because it is hierarchical taxonomy between

Candidatej and Keyword2 “FengSu (Custom)”, we can

use “De” (similar to “of” in English) as link between the

Candidatej and the Keyword2.

As an example of query that is composed of Candidatej

and Keywordi, a keyword of festival “Chinese New Year”

join with Candidate custom instance “Light Firework”,

forming “Chinese New Year Light Firework” in Chinese

syntax. The query for the keyword “Custom” join with

“Light Firework”, such as “Light Firework De Custom” in

Chinese.

Step 5. Determination of threshold and extraction. In

experiment we calculate the scores for both Keyword1

and Keyword2, which are denoted by Score1 and Score2.

3http://code.google.com/p/ictclas4j/

The corresponding constraint factors of threshold value are

k1 and k2. The candidate relation instance whose rela-

tion scores are bigger than corresponding thresholds ki ·AV G(scorei) is extracted as relation instance.

4.2. Relation Instances Instantiation

In the Chinese festival ontology, we extract customs of

the relation “has custom” for instances of Festival. Ta-

ble 1 presents some examples of extracted relation instances

for one of Festival instance “Dragon Boat Festival”. We

show the extracted relation instances’ names, scores of

relevancy measured by the two keywords, and evaluation

(1=‘correct’, 0=‘incorrect’).

There have some noise terms related to origin and object

of custom, such as term “BiXie(Drive off Evil Spirits)” for

driving off evil spirits and pestilence from our lives.

Table 1. Samples of extracted relation in-stances for “Dragon Boat Festival”

Relation Instances Score1 Score2 Evaluation

Eat Rice Dumplings(Zongzi) 0.00021 0.00192 1

Commemoration of Qu Yuan 0.01282 0.75991 1

Dragon Boat Race 0.05692 0.55000 1

Make Rice Dumplings 0.00022 0.00192 1

Wear Fragrant Sachets 4.048E-05 0.01724 1

Drink XiongHuang Wine 0.00508 0.91094 1

Hang Branches of Moxa 0.00214 0.15071 1

Drive Away the Five Poisonous Pests 0.00196 0.01274 1

Hang Calamus 0.01098 0.97802 1

Drive off Evil Spirits 0.00018 0.08076 0

Worshipping the Ancestors 0.00701 0.04008 1

... ... ... ...

4.3. Ontology Building by SPARQL

The Protege-OWL API is an open-source Java library for

the OWL and RDF(S). The API provides classes and meth-

ods to load and save OWL files, to query and manipulate

OWL data models [12]. We integrate a Jena-based SPARQL

query engine into Protege by Protege-OWL API, and load

automatically the extracted instances in OWL in Chinese.

As initializtion, we defined classes Custom,

Festival Custom and Festival, the object property

“has custom” of Festival class, and some festival

instances in Chinese festival ontology by protege. Each of

Festival instance as one of domain Keyword1 is selected

from the domain ontology by “SELECT” command

of SPARQL. At the same time, Keyword2 “FengSu

(Custom)” is acquired from the predefined owlModel of the

ontology.

When the relation instances are inserted into the domain

ontology in OWL, the relation instances belongs to the up-

per class, it is also the object property values for the relation

301

class. The festival customs in our experiments are not only

the instances of the Festival Custom class, but also the

values of the object property “has custom” of Festivalinstances. Fig.2 shows the process of Ontology loading by

SPARQL. Fig.3 shows part of the results representation in

OWL for “Dragon Boat Festival”.

// To get the relation class Festival_Custom

String ns = "http://www.owl-ontologies.com/.../ ChineseF.owl";

Resource Festival_Custom = owlModel.getResource( ns

+ "#Festival_Custom" );

// To get the relation “has_custom”

has_custom =owlModel.getObjectProperty(ns

+ "#has_custom ");

// To get one of instances festival

// n is the number of the instances of Festival class

for(int i=0;i<n;i++)

{

estival =owlModel.getIndividual(ns + "#festival ");

// To load each relation instance form festival ’s DB

// m is number of extracted instances of festival

for(int j=0;j<m;j++)

{

// Insert the instances for Festival_Custom class

Instance =owlModel.createIndividual( ns + "#"+

instance[j], Festival_Custom);

// Insert the value for object property has_custom

of the festival

festival .addProperty(has_custom, instance );

}

}

Figure 2. Loading the relation instances bySPARQL

5. Evaluation and Discussion

In this section, we evaluate the experimental results in

four festivals. We have extracted 71 relation instances from

4858 candidate instances which are filtered based on the

syntactic patterns. Our method has obtained the precision,

recall and F-measure as shown in Table 2, the ratio of cor-

rect relation instances to extracted relation instances for the

precision, and ratio of correct relation instances to correct

relation instances manually for the recall.

The average F-measure value of four festivals is 0.854,

the extracted Festival instances have represented the com-

mon Festival Custom instances. The results are calcu-

lated by the festival customs extracted manually from the

Web.

In the experiment, we obtained efficient extraction re-

sults using the general Web search engine and simple syn-

<?xml version="1.0"?>

<rdf:RDF

…

xmlns="http://www.owl-ontologies.com/2008/2/10/ ChineseF.owl#"

…

<owl:Class rdf:ID="Festival_Custom"/>

<owl:Class rdf:ID="Festival"/>

<owl:ObjectProperty rdf:ID="has_custom">

<rdfs:domain rdf:resource="#Festival"/>

<rdfs:range rdf:resource="#Festival_Custom"/>

</owl:ObjectProperty>

<Festival_Custom rdf:ID=" "/> // " "Wear Fragrant Sachets

<Festival_Custom rdf:ID=" "/> // " "Eat Rice Dumplings(Zongzi)

…

<Festival rdf:ID="Dragon Boat Festival">

<has_custom rdf:resource="# "/>

<has_custom rdf:resource="# "/>

…

</Festival>

</rdf:RDF>

…

Figure 3. An example representation of theextracted relation instances for “Dragon BoatFestival” in OWL

Table 2. Precision, recall, and f-measure ofextraction for relation instances of four fes-tivals

Festival Instances Precision(%) Recall(%) F-measure

Dragon Boat Festival 91.3 80.7 0.856

Lantern Festival 84.2 76.1 0.799

Qingming Festival 93.7 88.2 0.908

Chongyang Festival 69.2 100 0.818

tax analysis. But if we had not used a simple sentence or

a phrase with double quotation marks (“...”) from the Web,

the average extraction precision by our algorithm would be

no more than 53%.

Table 3 shows the average precision, recall, and F-

measure of the extracted results using combination key-

words or each keyword separately for corresponding the

festivals showed in the Table 2. As we proposed about ex-

traction of relation instances, the extracted instances is not

only as instance of class, but also have some relationships

with another class, therefore, the results are relevant to two

class information. It obtained better performance using bi-

nary keywords than only one keyword relatively. The exper-

imental results demonstrate the efficiency of the proposed

method.

302

Table 3. The average precision, recall, and f-measure of extraction for relation instancesby different keyword number

Keyword Precision(%) Recall(%) F-measure

Two Keywords 85.6 86.2 0.854

Keyword1 86.1 72.3 0.785

Keyword2 79.3 36.7 0.501

6. Conclusions and Further Research

We proposed an ontology learning system based the key-

word search engine of Internet and Protege-OWL API and

presented a practical system of ontology learning through

learning relation instances by the Chinese festival ontology.

The system acquired the information from the Web and built

ontology with a standard ontological language OWL by pro-

tege. At the same time, we described the procedure to load

automatically the extracted instances in OWL by SPARQL.

We described two stages of learning relation instances,

one is filtering of concepts from the Web based syntac-

tic patterns, another is calculating the relevance for filtered

candidate instances. The specification of our method lies in

three facets. Firstly, the extraction rules of the activity enti-

ties are pattern-based approach for detecting specializations

using such as V+NP, V or VN alone. The recall results are

complete relatively without special lexical restraint. Sec-

ondly, each extracting relation instance is measured with the

binary domain keywords. Finally, the query for computing

the relevance used some words such ”De”, and with double

quotation marks is proposed to improve the accuracy of the

extraction.

In the future, firstly we will introduce deeper syntactic

properties to obtain new knowledge. For example, we can

divide the extracted customs into eating and drinking cus-

toms, and others activity customs in the Custom class, and

derive the name of the food related to the festival to enrich

the domain ontology. Then we will extract the relation in-

stances for independent domain ontology, and extract the

relations for existing instances automatically.

Acknowledgments.

This research has been partially supported by the Ministry

of Education, Science, Sports and Culture, Grant-in-Aid for

Scientific Research, 19300029.

References

[1] T.R. Gruber, Towards Principles for the Design of Ontologies

Used for Knowledge Sharing, In Proceeding of the Interna-

tional Workshop on Formal Ontology, pp. 907-928, Padova,

Italy, August 1993.[2] R. Navigli, P. Velardi, Learning Domain Ontologies from

Document Warehouses and Dedicated Web Sites, Computa-tional syntactics, MIT press, 30(2):151-179, June 2004.

[3] A. Gomez-Perez, D. Manzano-Macho, A survey of ontology

learning methods and Techniques, OntoWeb project, Deliver-

able D.1.5, June 2003.[4] R.D. Sanchez, Domain Ontology Learning from

Web, http://www.tdx.cbuc.es/TESIS_UPC/AVAILABLE/TDX-0122108-103125, 2007.

[5] M. Kavalec, A. Maedche, and V. Svatek, Discovery of Lexical

Entries for Non-taxonomic Relations in Ontology Learning,

In Proceedings of SOFSEM 2004,Springer Berlin / Heidel-

berg, LNCS 2932 , 249-256, 2004.[6] V.D. Boer, M.V. Someren, and B.J. Wielinga, Relation Instan-

tiation for Ontology Population using the Web, In. Proceed-ings of the 29th Annual German Conference on Artificial In-telligence, KI 2006, Springer Berlin / Heidelberg, pp.202-213.

Bremen, Germany, June 2006.[7] A. Maedche, S. Staab, Ontology Learning for the Semantic

Web, IEEE Intelligent Systems, IEEE Educational Activities

Department, 16(2): 72-79, March 2001.[8] M.A. Hearst, Automatic acquisition of hyponyms from large

text corpora, In Proceedings of the 14th International Con-ference on Computational syntactics (COLING-92), Nantes,

France, pp. 539-545, 1992.[9] P.D. Turney, Mining the Web for synonyms: PMI-IR ver-

sus LSA on TOEFL, In Proceedings of the Twelfth Euro-pean Conference on Machine Learning, Freiburg, Germany,

pp. 491-499, 2001.[10] M.K. Smith, C. Welty, and D.L. McGuinness, OWL

Web Ontology Language Guide, W3C Recommenda-

tion 10 February, http://www.w3.org/TR/2004/REC-owl-guide-20040210/#FeatureList, 2004.

[11] E. Prud’hommeaux, A. Seaborne, SPARQL Query

Language for RDF, http://www.w3.org/TR/rdf-sparql-query/, W3C Recommendation, 15

January, 2008.[12] H. Knublauch, Protege staff members, protege-owl api

programmer’s guide, http://protege.stanford.edu/plugins/owl/api/guide.html, Last updated:

September 21, 2006.[13] S. Brin, Extracting patterns and relations from the world

wide web, Lecture Notes in Computer Science,1590: 172-

183, 1999.[14] S. Sahay, S. Mukherjea, E. Agichtein, E. V. Garcia, S.

B. Navathe, and A. Ram, Discovering semantic biomedi-

cal relations utilizing wide web, ACM Trans. Knowl. Discov.Data,ACM 1556-4681, 2(1): 1-15, March 2008.

303

[ieee 2009 fourth international conference on internet and web applications and services -...

Documents