ontoqa: metric-based ontology quality analysis

Post on 31-Dec-2015

24 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

OntoQA: Metric-Based Ontology Quality Analysis. Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources Houston, Texas, November 27, 2005. - PowerPoint PPT Presentation

TRANSCRIPT

OntoQA: Metric-Based Ontology Quality Analysis

Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth,

Boanerges Aleman-Meza

IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge

Sources

Houston, Texas, November 27, 2005

The Semantic Web• Current web is intended for human use• Semantic web is for humans and

computers• Semantic web uses ontologies as a

knowledge-sharing vehicle.• Many ontologies currently exist: GO, OBO,

SWETO, TAP, GlycO, PropreO, etc.

Motivation

• Having several ontologies to choose from, users often face the problem of selecting the best ontology that is suitable for their needs.

OntoQA• Metric-Based Ontology Quality Analysis

• Describes ontology schemas and instancebases (IBs) through different sets of metrics

• OntoQA is implemented as a part of SemDis project.

Documentsdatabases

Open/proprietary Heterogeneous Data Sources

HtmlXMLfeeds

PopulatedOntology

Ontology Schema

Emails

Contributions

• Defining the quality of ontologies in terms of:• Schema• Instances

• IB Metrics• Class-extent metrics

• Providing metrics to quantitatively describe each group

I. Schema Metrics• Schema metrics address the design of the

ontology schema.

• Schema quality could be hard to measure: domain expert consensus, subjectivity etc.

• Three metrics:– Relationship richness– Attribute richness– Inheritance richness

I.1 Relationship Richness

• How close or far is the schema structure to a taxonomy?

• Diversity of relations is a good indication of schema richness.

PIsA

PRR

|P|: Number of non-IsA relationships

|IsA|: Number of IsA relationships

I.2 Attribute Richness

• How much information do classes contain?

C

AAR

|A|: Number of literal attributes

|C|: Number of classes

I.3 Inheritance Richness (Fan-out)• General (e.g. spanning various domains) vs.

specific

C

C,CHCC

ijC

SiIR

|Hc(cj, ci)|: Number of subclasses of Class Ci

|C|: Number of classes

II. Instance Metrics• Deal with the size and distribution of the

instance data.

• Instance metrics are grouped into two subcategories:

1. IB metrics: describe the IB as a whole2. Class metrics: describe the way each class that is

defined in the schema is being utilized in the IB

II.1.a Class Richness

• How much does the IB utilizes classes defined in the schema?

• How many classes (in the schema) are actually populated?

C

CCR

`

|C’|: Number of used classes

|C|: Number of defined classes

II.1.b Average Population

• How well is the IB “filled”?

C

IP

|I|: Number of instances

|C|: Number of defined classes

II.1.c Cohesion

• Is IB graph connected or disconnected?

CCCoh

|CC|: Number of connected components

II.2.a Importance

• How much focus was paid to each class during instance population?

I

)I(CImp i

Ci

|Ci(I)|: Number of instances defined for class Ci

|I|: Number of instances

II.2.b Connectivity

• What classes are central and what are on the boundary?

C}C(I),CI(I)CI)I,P(I :{IConn jjjiijijCi

P(Ii,Ij): Relationships between instances Ii and Ij.

Ci(I): Instances of class Ci.

C: Defined classes.

II.2.c Fullness

• Is the number of instances close to the expected?

|)I`(C|

)I(CF

i

i

|Ci(I)|: Number of instances of class Ci.

|Ci’(I)|: Number of expected instances of class Ci.

II.2.d Relationship Richness

• How well does the IB utilize relationships defined in the schema?

)C,C(P

}CC),I(CI),I(CI:))I,I(P(Distinct{RR

ji

jjjiiji

Ci

P(Ii,Ij): Relationships between instances Ii and Ij.

Ci(I): Instances of class Ci.

Cj(I): Instances of class Cj.

C: Defined classes

P(Ci,Cj): Relationships between instances Ci and Cj.

II.2.e Inheritance Richness

• Is the class general or specific?

'C

C,CH

IR'CC

jkC

Cj

i

C’: Classes belonging to the subtree rooted at Ci

|Hc(ck, cj)|: Number of subclasses of Class Ci

Implementation

• Written in Java

• Processes ontology schema and IB files written in OWL, RDF, or RDFS.

• Uses the Sesame to process the ontology schema and IB files.

Testing• SWETO: LSDIS’ general-purpose ontology that covers

domains including publications, affiliations, geography and terrorism.

• TAP: Stanford’s general-purpose ontology. It is divided into 43 domains. Some of these domains are publications, sports and geography.

• GlycO: LSDIS’ ontology for the Glycan Expression

• OBO: Open Biomedical Ontologies

Results – Class Metrics

Ontology # of Classes

# of Instances

Inheritance Richness

Class Richness

Average Population

SWETO 44 1,003,021 0.9 56.8% 22,795.9

TAP 3,230 71,487 1.2 9.4% 22.1

GlycO 356 387 1.3 18.0% 1.1

PropreO 244 0 1.0 0.0% 0.0

Results – Class Importance

Class Importance

010203040506070

Public

atio

n

Scie

ntif

ic_P

ublic

atio

n

Com

pute

r_S

cie

nce_

Researc

her

Org

aniz

atio

n

Com

pany

Confe

rence

Pla

ce

City

Bank

Airport

Terr

orist_

Attack

Event

AC

M_S

ubje

ct_

Desc

ripto

rs

Class

Class Importance

05

101520253035

Mus

icia

n

Ath

lete

Aut

hor

Act

or

Mov

ie

Per

sona

lCom

pute

rG

ame

Boo

k

Pro

duct

Typ

e

Uni

tedS

tate

sCity

Uni

vers

ity City

For

tune

1000

Com

pan

y Ast

rona

ut

Com

icS

trip

Class

SWETO TAP

GlycO

Class Importance

010203040506070

N-g

lyca

n

gly

can

_m

oie

ty

N-g

lyca

n_

resi

du

e

carb

oh

ydra

te_

resi

du

e_

pro

pe

rty

N-g

lyca

n_

alp

ha

-D-

Ma

np

alp

ha

-D-

ma

nn

op

yra

no

syl_

resi

du

e

N-g

lyca

n_

be

ta-D

-G

lcp

NA

c

N-a

cety

l-b

eta

-D-

glu

cop

yra

no

sam

inyl

_re

sid

ue

mo

lecu

lar_

fra

gm

en

t

sug

ar_

con

figu

ratio

n

be

ta-D

-g

ala

cto

pyr

an

osy

l_re

sid

ue

N-g

lyca

n_

be

ta-D

-Ga

lp

N-g

lyca

n_

alp

ha

-N

eu

5A

c

sug

ar_

stru

ctu

ral_

vari

an

t

Class

Results – Class ConnectivityClass Connectivity

0123456789

Terr

orist_

Attack

Bank

Airport

AC

M_S

econd_le

vel

_C

lassifi

catio

n

AC

M_T

hird_le

vel_

Cl

assifi

catio

n City

Sta

te

AC

M_S

ubje

ct_

Desc

ripto

rs

AC

M_T

op_le

vel_

Cla

ssifi

catio

n

Com

pute

r_S

cie

nce_

Researc

her

Scie

ntif

ic_P

ublic

atio

n Com

pany

Terr

orist_

Org

aniz

ati

on

Class

Class Connectivity

01234567

CM

UF

acul

ty

Per

son

Res

earc

hPro

jec

t

Mai

lingL

ist

CM

UG

radu

ateS

tud

ent

CM

UP

ublic

atio

n

CM

U_R

AD

W3C

Spe

cific

ati

on

W3C

Per

son

W3C

Wor

king

Dr

aft

Com

pute

rSci

enti

st

CM

UC

ours

e

Bas

ebal

lTea

m

W3C

Not

e

Class

SWETO TAP

GlycO

Class Connectivity

02468

1012

N-g

lyca

n_be

ta-D

-G

alpN

Ac

N-g

lyca

n_be

ta-D

-G

lcpN

Ac

N-g

lyca

n_al

pha-

Neu

5Ac

N-g

lyca

n_al

pha-

D-

Gal

p

N-g

lyca

n_al

pha-

L-F

ucp

N-g

lyca

n_al

pha-

Neu

5Gc

N-g

lyca

n_be

ta-D

-X

ylp

N-g

lyca

n_be

ta-D

-G

alp

N-g

lyca

n_al

pha-

D-

Glc

p

N-g

lyca

n_al

pha-

D-

Man

p

N-g

lyca

n_be

ta-D

-M

anp

N-a

cety

l-gl

ucos

amin

yl_t

rans

fer

ase_

V

N-g

lyca

n_al

pha-

D-

Glc

pNA

c

N-g

lyca

n_D

-G

lcN

Ac-

ol

Class

BioMedical OntologiesOntology No. of Terms

(Instances)Average No. of

SubtermsConnectivit

y

Protein-protein Interaction

195 4.6 1.1

MGED 228 5.1 0.3

Biological Imaging Methods

260 5.2 1.0

Physico-chemical Process

550 2.7 1.3

Cereal Plant Trait 692 3.7 1.1

BRENDA 2,222 3.3 1.2

Human Disease 19,137 5.5 1.0

Gene Ontology 20,002 4.1 1.4

Conclusions

• More ontologies are introduced as the semantic web is gaining momentum.

• There is no easy way for users to choose the most suitable ontology for their applications.

• OntoQA offers 3 categories of metrics to describe the quality and nature of an ontology.

Future Work

• Calculation of domain dependent metrics that makes use of some standard ontology in a certain domain.

• Making OntoQA a web service where users can enter their ontology files paths and use OntoQA to measure the quality of the ontology.

Questions

top related