using semantic web resources for data quality management

Post on 05-Dec-2014

1.894 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Using Semantic Web

Resources for Data Quality

Management

Christian Fürber and Martin Hepp

christian@fuerber.com, mhepp@computer.org

Presentation at the 17th International Conference on

Knowledge Engineering and Knowledge Management,

October 10-15, 2010, Lisbon, Portugal

Purpose of Data

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 2

101010101

010101010

101010101

001010101

001010101

DATA

Measurement

Automation

Information &

Knowledge

Decisions

Data Quality in Practice

3

Reference: http://www.heise.de/newsticker/meldung/Comdirect-Bank-macht-Kunden-zu-Billiardaeren-996088.html

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

4

The Web of Messy Data?

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Which one is

the correct

population?

Retr

ieve

d fro

m h

ttp

://d

bp

edia

.org

/sp

arq

l o

n J

uly

20

th

5

The Web of Messy Data?

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Retr

ieve

d fro

m h

ttp

://d

bp

edia

.org

/sp

arq

l o

n J

uly

20

th

Places with

negative

population?!?

Risk of Failure

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 6

101010101

010101010

101010101

001010101

001010101

DATA

Measurement

Automation

Information &

Knowledge

Decisions

Data Quality Problem Types

7

Character alignment violation

Invalid characters

Word transpositions

Invalid substrings Mistyping / Misspelling errors

False values

Misfielded values

Meaningless values

Missing values

Out of range values

Functional Dependency

Violation

Incorrect reference

Referential integrity violation

Contradictory relationships

Imprecise values

Existence of Synonyms

Existence of Homonyms

Unique value violation

Inconsistent duplicates

Approximate duplicates

Outdated values Outdated conceptual elements

Cardinality violation

Missing classification

Untyped literals

Incorrect classification

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Refe

rence: L

inkin

g O

pen D

ata

clo

ud d

iagra

m, b

y

Ric

hard

Cygania

k a

nd A

nja

Jentz

sch. h

ttp://lo

d-c

loud.n

et/

Goals

• Use Semantic Web data to identify data

quality problems on instance level

• Support Data Quality Management (DQM)

process

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 8

Total Data Quality Management

for and based on the Semantic Web

9

Measure

Analyze Improve

Define

DQ

Reference: Richard Wang (1998)

Define what‘s

good and / or

what‘s poor

data quality

Develop and

apply SPARQL

queries based

on DQ-

Definition

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

How can the Semantic Web support

Data Quality Management?

10

Availability of FREE Data Quality Knowledge,

e.g. for the identification of…

• Legal value violations

• Functional dependency violations

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Using Trusted References

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 11

local:Location tref:Location

Las Vegas

France

Las Vegas

USA

Las Vegas France

Tested Knowledgebase Trusted Reference

DQ-Constraints

Basic Architecture

12 C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Basic Characteristics of SPIN

• Allows definition of generalized

SPARQL query templates

• Constraint checking based on

SPARQL

• Definition of inferencing rules via

SPARQL

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 13

http://spinrdf.org/

Generic Data Quality Constraints

Library for Easy DQ-Defintion

14 C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

available @ http://semwebquality.org/ontologies/dq-constraints#

• Mandatory properties &

literals

• Legal values*

• Legal value ranges

• Functional dependencies*

• Legal syntaxes

• Uniqueness

* Designed to use trusted references

Definition of Data Quality

Constraints based on SPIN

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 15

Constraint checking in Practice

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 16

Legal Value Constraints

17

SELECT ?s

WHERE {

?s a vcard:Address .

?s vcard:country-name ?value .

OPTIONAL {

?s2 a tref:Location .

?s2 tref:country ?value1 .

FILTER(str(?value1)= str(?value))

} .

FILTER(!bound(?value1))

}

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Return all instances of class vcard:Address that do not have a

matching value for property vcard:country-name in property

tref:country

Functional Dependency Constraints

18

SELECT ?s

WHERE {

?s a gr:LocationOfSalesOrServiceProvisioning .

?s vcard:ADR ?node

?node vcard:city ?value1 .

?node vcard:country ?value2 .

NOT EXISTS {

?s2 a gn:Location .

?s2 gn:asciiname ?value1 .

?s2 gn:country ?value2 .

}}

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Return all instances of vcard:ADR with city-country-combinations

that do not have a matching pair in instances of gn:Location.

Acquisition of Semantic Web

Sources for DQM

(1) Replication of relevant knowledge-bases

(2) On the fly via federated SPARQL queries:

19

PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT *

WHERE {

?s1 :location_CITY ?city .

OPTIONAL{

SERVICE <http://dbpedia.org/sparql>{

?s2 a dbo:City .

?s2 rdfs:label ?city .

FILTER (lang(?city) = "en") .

}

}

FILTER(!bound(?s2))

}

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Limitations

• High degree of uncertainty about quality of Semantic

Web resources

• Risk for data quality problem proliferation

• Lack of Semantic Web resources for certain domains

• Flexible design of RDF and structural heterogeneity

complicate definition of generic DQ constraints

• Scalability on large data sets

• DQ constraints close the world

20 C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Contributions

• Data quality control for Semantic Web data

• Identification of potential inconsistencies

between Semantic Web Resources

• Reduction of effort for the definition of functional

dependency rules and legal value rules

• Reuse of shared data quality rules on a Web

scale

21 C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Future Work

• Semantic Web information quality assessment

framework (SWIQA) with computation of KPI‘s

• Analysis and identification of useful „trusted

references“ based on SWIQA

• Application on multi-source master data of

information systems

• Evaluation on large data sets

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 22

23

Christian Fürber Researcher

E-Business & Web Science Research Group

Werner-Heisenberg-Weg 39

85577 Neubiberg

Germany

skype c.fuerber

email christian@fuerber.com

web http://www.unibw.de/ebusiness

homepage http://www.fuerber.com

twitter http://www.twitter.com/cfuerber

Data Quality Constraints Library for SPIN @

http://semwebquality.org/ontologies/dq-constraints#

Paper available at http://bit.ly/c5v6TM

top related