ontoqa: metric-based ontology quality analysis samir tartir, i. budak arpinar, michael moore, amit...

27
OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman- Meza IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources Houston, Texas, November 27, 2005

Upload: adela-watkins

Post on 14-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

OntoQA: Metric-Based Ontology Quality Analysis

Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth,

Boanerges Aleman-Meza

IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge

Sources

Houston, Texas, November 27, 2005

Page 2: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

The Semantic Web• Current web is intended for human use• Semantic web is for humans and

computers• Semantic web uses ontologies as a

knowledge-sharing vehicle.• Many ontologies currently exist: GO, OBO,

SWETO, TAP, GlycO, PropreO, etc.

Page 3: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Motivation

• Having several ontologies to choose from, users often face the problem of selecting the best ontology that is suitable for their needs.

Page 4: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

OntoQA• Metric-Based Ontology Quality Analysis

• Describes ontology schemas and instancebases (IBs) through different sets of metrics

• OntoQA is implemented as a part of SemDis project.

Documentsdatabases

Open/proprietary Heterogeneous Data Sources

HtmlXMLfeeds

PopulatedOntology

Ontology Schema

Emails

Page 5: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Contributions

• Defining the quality of ontologies in terms of:• Schema• Instances

• IB Metrics• Class-extent metrics

• Providing metrics to quantitatively describe each group

Page 6: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

I. Schema Metrics• Schema metrics address the design of the

ontology schema.

• Schema quality could be hard to measure: domain expert consensus, subjectivity etc.

• Three metrics:– Relationship richness– Attribute richness– Inheritance richness

Page 7: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

I.1 Relationship Richness

• How close or far is the schema structure to a taxonomy?

• Diversity of relations is a good indication of schema richness.

PIsA

PRR

|P|: Number of non-IsA relationships

|IsA|: Number of IsA relationships

Page 8: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

I.2 Attribute Richness

• How much information do classes contain?

C

AAR

|A|: Number of literal attributes

|C|: Number of classes

Page 9: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

I.3 Inheritance Richness (Fan-out)• General (e.g. spanning various domains) vs.

specific

C

C,CHCC

ijC

SiIR

|Hc(cj, ci)|: Number of subclasses of Class Ci

|C|: Number of classes

Page 10: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II. Instance Metrics• Deal with the size and distribution of the

instance data.

• Instance metrics are grouped into two subcategories:

1. IB metrics: describe the IB as a whole2. Class metrics: describe the way each class that is

defined in the schema is being utilized in the IB

Page 11: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II.1.a Class Richness

• How much does the IB utilizes classes defined in the schema?

• How many classes (in the schema) are actually populated?

C

CCR

`

|C’|: Number of used classes

|C|: Number of defined classes

Page 12: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II.1.b Average Population

• How well is the IB “filled”?

C

IP

|I|: Number of instances

|C|: Number of defined classes

Page 13: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II.1.c Cohesion

• Is IB graph connected or disconnected?

CCCoh

|CC|: Number of connected components

Page 14: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II.2.a Importance

• How much focus was paid to each class during instance population?

I

)I(CImp i

Ci

|Ci(I)|: Number of instances defined for class Ci

|I|: Number of instances

Page 15: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II.2.b Connectivity

• What classes are central and what are on the boundary?

C}C(I),CI(I)CI)I,P(I :{IConn jjjiijijCi

P(Ii,Ij): Relationships between instances Ii and Ij.

Ci(I): Instances of class Ci.

C: Defined classes.

Page 16: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II.2.c Fullness

• Is the number of instances close to the expected?

|)I`(C|

)I(CF

i

i

|Ci(I)|: Number of instances of class Ci.

|Ci’(I)|: Number of expected instances of class Ci.

Page 17: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II.2.d Relationship Richness

• How well does the IB utilize relationships defined in the schema?

)C,C(P

}CC),I(CI),I(CI:))I,I(P(Distinct{RR

ji

jjjiiji

Ci

P(Ii,Ij): Relationships between instances Ii and Ij.

Ci(I): Instances of class Ci.

Cj(I): Instances of class Cj.

C: Defined classes

P(Ci,Cj): Relationships between instances Ci and Cj.

Page 18: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

II.2.e Inheritance Richness

• Is the class general or specific?

'C

C,CH

IR'CC

jkC

Cj

i

C’: Classes belonging to the subtree rooted at Ci

|Hc(ck, cj)|: Number of subclasses of Class Ci

Page 19: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Implementation

• Written in Java

• Processes ontology schema and IB files written in OWL, RDF, or RDFS.

• Uses the Sesame to process the ontology schema and IB files.

Page 20: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Testing• SWETO: LSDIS’ general-purpose ontology that covers

domains including publications, affiliations, geography and terrorism.

• TAP: Stanford’s general-purpose ontology. It is divided into 43 domains. Some of these domains are publications, sports and geography.

• GlycO: LSDIS’ ontology for the Glycan Expression

• OBO: Open Biomedical Ontologies

Page 21: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Results – Class Metrics

Ontology # of Classes

# of Instances

Inheritance Richness

Class Richness

Average Population

SWETO 44 1,003,021 0.9 56.8% 22,795.9

TAP 3,230 71,487 1.2 9.4% 22.1

GlycO 356 387 1.3 18.0% 1.1

PropreO 244 0 1.0 0.0% 0.0

Page 22: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Results – Class Importance

Class Importance

010203040506070

Public

atio

n

Scie

ntif

ic_P

ublic

atio

n

Com

pute

r_S

cie

nce_

Researc

her

Org

aniz

atio

n

Com

pany

Confe

rence

Pla

ce

City

Bank

Airport

Terr

orist_

Attack

Event

AC

M_S

ubje

ct_

Desc

ripto

rs

Class

Class Importance

05

101520253035

Mus

icia

n

Ath

lete

Aut

hor

Act

or

Mov

ie

Per

sona

lCom

pute

rG

ame

Boo

k

Pro

duct

Typ

e

Uni

tedS

tate

sCity

Uni

vers

ity City

For

tune

1000

Com

pan

y Ast

rona

ut

Com

icS

trip

Class

SWETO TAP

GlycO

Class Importance

010203040506070

N-g

lyca

n

gly

can

_m

oie

ty

N-g

lyca

n_

resi

du

e

carb

oh

ydra

te_

resi

du

e_

pro

pe

rty

N-g

lyca

n_

alp

ha

-D-

Ma

np

alp

ha

-D-

ma

nn

op

yra

no

syl_

resi

du

e

N-g

lyca

n_

be

ta-D

-G

lcp

NA

c

N-a

cety

l-b

eta

-D-

glu

cop

yra

no

sam

inyl

_re

sid

ue

mo

lecu

lar_

fra

gm

en

t

sug

ar_

con

figu

ratio

n

be

ta-D

-g

ala

cto

pyr

an

osy

l_re

sid

ue

N-g

lyca

n_

be

ta-D

-Ga

lp

N-g

lyca

n_

alp

ha

-N

eu

5A

c

sug

ar_

stru

ctu

ral_

vari

an

t

Class

Page 23: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Results – Class ConnectivityClass Connectivity

0123456789

Terr

orist_

Attack

Bank

Airport

AC

M_S

econd_le

vel

_C

lassifi

catio

n

AC

M_T

hird_le

vel_

Cl

assifi

catio

n City

Sta

te

AC

M_S

ubje

ct_

Desc

ripto

rs

AC

M_T

op_le

vel_

Cla

ssifi

catio

n

Com

pute

r_S

cie

nce_

Researc

her

Scie

ntif

ic_P

ublic

atio

n Com

pany

Terr

orist_

Org

aniz

ati

on

Class

Class Connectivity

01234567

CM

UF

acul

ty

Per

son

Res

earc

hPro

jec

t

Mai

lingL

ist

CM

UG

radu

ateS

tud

ent

CM

UP

ublic

atio

n

CM

U_R

AD

W3C

Spe

cific

ati

on

W3C

Per

son

W3C

Wor

king

Dr

aft

Com

pute

rSci

enti

st

CM

UC

ours

e

Bas

ebal

lTea

m

W3C

Not

e

Class

SWETO TAP

GlycO

Class Connectivity

02468

1012

N-g

lyca

n_be

ta-D

-G

alpN

Ac

N-g

lyca

n_be

ta-D

-G

lcpN

Ac

N-g

lyca

n_al

pha-

Neu

5Ac

N-g

lyca

n_al

pha-

D-

Gal

p

N-g

lyca

n_al

pha-

L-F

ucp

N-g

lyca

n_al

pha-

Neu

5Gc

N-g

lyca

n_be

ta-D

-X

ylp

N-g

lyca

n_be

ta-D

-G

alp

N-g

lyca

n_al

pha-

D-

Glc

p

N-g

lyca

n_al

pha-

D-

Man

p

N-g

lyca

n_be

ta-D

-M

anp

N-a

cety

l-gl

ucos

amin

yl_t

rans

fer

ase_

V

N-g

lyca

n_al

pha-

D-

Glc

pNA

c

N-g

lyca

n_D

-G

lcN

Ac-

ol

Class

Page 24: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

BioMedical OntologiesOntology No. of Terms

(Instances)Average No. of

SubtermsConnectivit

y

Protein-protein Interaction

195 4.6 1.1

MGED 228 5.1 0.3

Biological Imaging Methods

260 5.2 1.0

Physico-chemical Process

550 2.7 1.3

Cereal Plant Trait 692 3.7 1.1

BRENDA 2,222 3.3 1.2

Human Disease 19,137 5.5 1.0

Gene Ontology 20,002 4.1 1.4

Page 25: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Conclusions

• More ontologies are introduced as the semantic web is gaining momentum.

• There is no easy way for users to choose the most suitable ontology for their applications.

• OntoQA offers 3 categories of metrics to describe the quality and nature of an ontology.

Page 26: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Future Work

• Calculation of domain dependent metrics that makes use of some standard ontology in a certain domain.

• Making OntoQA a web service where users can enter their ontology files paths and use OntoQA to measure the quality of the ontology.

Page 27: OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza IEEE Workshop on Knowledge

Questions