using semantic web resources for data quality management
DESCRIPTION
TRANSCRIPT
![Page 1: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/1.jpg)
Using Semantic Web
Resources for Data Quality
Management
Christian Fürber and Martin Hepp
[email protected], [email protected]
Presentation at the 17th International Conference on
Knowledge Engineering and Knowledge Management,
October 10-15, 2010, Lisbon, Portugal
![Page 2: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/2.jpg)
Purpose of Data
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 2
101010101
010101010
101010101
001010101
001010101
DATA
Measurement
Automation
Information &
Knowledge
Decisions
![Page 3: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/3.jpg)
Data Quality in Practice
3
Reference: http://www.heise.de/newsticker/meldung/Comdirect-Bank-macht-Kunden-zu-Billiardaeren-996088.html
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
![Page 4: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/4.jpg)
4
The Web of Messy Data?
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Which one is
the correct
population?
Retr
ieve
d fro
m h
ttp
://d
bp
edia
.org
/sp
arq
l o
n J
uly
20
th
![Page 5: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/5.jpg)
5
The Web of Messy Data?
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Retr
ieve
d fro
m h
ttp
://d
bp
edia
.org
/sp
arq
l o
n J
uly
20
th
Places with
negative
population?!?
![Page 6: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/6.jpg)
Risk of Failure
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 6
101010101
010101010
101010101
001010101
001010101
DATA
Measurement
Automation
Information &
Knowledge
Decisions
![Page 7: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/7.jpg)
Data Quality Problem Types
7
Character alignment violation
Invalid characters
Word transpositions
Invalid substrings Mistyping / Misspelling errors
False values
Misfielded values
Meaningless values
Missing values
Out of range values
Functional Dependency
Violation
Incorrect reference
Referential integrity violation
Contradictory relationships
Imprecise values
Existence of Synonyms
Existence of Homonyms
Unique value violation
Inconsistent duplicates
Approximate duplicates
Outdated values Outdated conceptual elements
Cardinality violation
Missing classification
Untyped literals
Incorrect classification
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Refe
rence: L
inkin
g O
pen D
ata
clo
ud d
iagra
m, b
y
Ric
hard
Cygania
k a
nd A
nja
Jentz
sch. h
ttp://lo
d-c
loud.n
et/
![Page 8: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/8.jpg)
Goals
• Use Semantic Web data to identify data
quality problems on instance level
• Support Data Quality Management (DQM)
process
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 8
![Page 9: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/9.jpg)
Total Data Quality Management
for and based on the Semantic Web
9
Measure
Analyze Improve
Define
DQ
Reference: Richard Wang (1998)
Define what‘s
good and / or
what‘s poor
data quality
Develop and
apply SPARQL
queries based
on DQ-
Definition
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
![Page 10: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/10.jpg)
How can the Semantic Web support
Data Quality Management?
10
Availability of FREE Data Quality Knowledge,
e.g. for the identification of…
• Legal value violations
• Functional dependency violations
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
![Page 11: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/11.jpg)
Using Trusted References
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 11
local:Location tref:Location
Las Vegas
France
Las Vegas
USA
Las Vegas France
Tested Knowledgebase Trusted Reference
DQ-Constraints
![Page 12: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/12.jpg)
Basic Architecture
12 C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
![Page 13: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/13.jpg)
Basic Characteristics of SPIN
• Allows definition of generalized
SPARQL query templates
• Constraint checking based on
SPARQL
• Definition of inferencing rules via
SPARQL
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 13
http://spinrdf.org/
![Page 14: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/14.jpg)
Generic Data Quality Constraints
Library for Easy DQ-Defintion
14 C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
available @ http://semwebquality.org/ontologies/dq-constraints#
• Mandatory properties &
literals
• Legal values*
• Legal value ranges
• Functional dependencies*
• Legal syntaxes
• Uniqueness
* Designed to use trusted references
![Page 15: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/15.jpg)
Definition of Data Quality
Constraints based on SPIN
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 15
![Page 16: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/16.jpg)
Constraint checking in Practice
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 16
![Page 17: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/17.jpg)
Legal Value Constraints
17
SELECT ?s
WHERE {
?s a vcard:Address .
?s vcard:country-name ?value .
OPTIONAL {
?s2 a tref:Location .
?s2 tref:country ?value1 .
FILTER(str(?value1)= str(?value))
} .
FILTER(!bound(?value1))
}
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Return all instances of class vcard:Address that do not have a
matching value for property vcard:country-name in property
tref:country
![Page 18: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/18.jpg)
Functional Dependency Constraints
18
SELECT ?s
WHERE {
?s a gr:LocationOfSalesOrServiceProvisioning .
?s vcard:ADR ?node
?node vcard:city ?value1 .
?node vcard:country ?value2 .
NOT EXISTS {
?s2 a gn:Location .
?s2 gn:asciiname ?value1 .
?s2 gn:country ?value2 .
}}
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Return all instances of vcard:ADR with city-country-combinations
that do not have a matching pair in instances of gn:Location.
![Page 19: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/19.jpg)
Acquisition of Semantic Web
Sources for DQM
(1) Replication of relevant knowledge-bases
(2) On the fly via federated SPARQL queries:
19
PREFIX dbo:<http://dbpedia.org/ontology/>
SELECT *
WHERE {
?s1 :location_CITY ?city .
OPTIONAL{
SERVICE <http://dbpedia.org/sparql>{
?s2 a dbo:City .
?s2 rdfs:label ?city .
FILTER (lang(?city) = "en") .
}
}
FILTER(!bound(?s2))
}
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
![Page 20: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/20.jpg)
Limitations
• High degree of uncertainty about quality of Semantic
Web resources
• Risk for data quality problem proliferation
• Lack of Semantic Web resources for certain domains
• Flexible design of RDF and structural heterogeneity
complicate definition of generic DQ constraints
• Scalability on large data sets
• DQ constraints close the world
20 C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
![Page 21: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/21.jpg)
Contributions
• Data quality control for Semantic Web data
• Identification of potential inconsistencies
between Semantic Web Resources
• Reduction of effort for the definition of functional
dependency rules and legal value rules
• Reuse of shared data quality rules on a Web
scale
21 C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
![Page 22: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/22.jpg)
Future Work
• Semantic Web information quality assessment
framework (SWIQA) with computation of KPI‘s
• Analysis and identification of useful „trusted
references“ based on SWIQA
• Application on multi-source master data of
information systems
• Evaluation on large data sets
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 22
![Page 23: Using Semantic Web Resources for Data Quality Management](https://reader033.vdocuments.us/reader033/viewer/2022051610/5485adc1b4af9faa0d8b4ea1/html5/thumbnails/23.jpg)
23
Christian Fürber Researcher
E-Business & Web Science Research Group
Werner-Heisenberg-Weg 39
85577 Neubiberg
Germany
skype c.fuerber
email [email protected]
web http://www.unibw.de/ebusiness
homepage http://www.fuerber.com
twitter http://www.twitter.com/cfuerber
Data Quality Constraints Library for SPIN @
http://semwebquality.org/ontologies/dq-constraints#
Paper available at http://bit.ly/c5v6TM