![Page 1: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/1.jpg)
Metadata Quality Assurance Framework
Péter Király <[email protected]>Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany
QQML2016 8th International Conference on Qualitative and Quantitative Methods in Libraries2016-05-24, London
![Page 2: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/2.jpg)
2
Metadata Quality Assurance Framework
the problemthere are „good” and „bad” metadata
records
![Page 3: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/3.jpg)
3
Metadata Quality Assurance Framework
Typical issues – non-informative field
Title is not informative
non informative:„photograph, framed”,„group photograph”„photograph”
vs
informative:„Photograph of Sir Dugald Clerk”,„Photograph of "Puffing Billy"
![Page 4: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/4.jpg)
4
Metadata Quality Assurance Framework
Typical issues – Copy & paste cataloging
Keeping placeholders / templates
![Page 5: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/5.jpg)
5
Metadata Quality Assurance Framework
Typical issues – Field overuse
What is the meaning of the field? (overuse)
TextGrid OAI-PMH response
![Page 6: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/6.jpg)
6
Metadata Quality Assurance Framework
Why data quality is important?
„Fitness for purpose” (QA principle)
no metadata no access to data no data usage
more explanation:Data on the Web Best PracticesW3C Working Draft 19 May 2016https://www.w3.org/TR/dwbp/
![Page 7: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/7.jpg)
7
Metadata Quality Assurance Framework
Europeana Data Quality Committee
Online collaboration Use case documents Problem catalog Tickets Discussion forum #EuropeanaDataQuality
Bi-weekly teleconf Bi-yearly face-to-face
meeting
Topics Usage scenarios Metadata profiles Schema modification Measuring Event model Proposals for data
providers
![Page 8: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/8.jpg)
8
Metadata Quality Assurance Framework
Research hypothesis
hypothesiswith measuring structural elements we
can predict metadata record quality
![Page 9: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/9.jpg)
9
Metadata Quality Assurance Framework
What it is good for?
improve the metadata improve services: good data → functions improve metadata schema &
documentation propagate „good practice”
Domains: cultural heritage sector research data management and
archiving
![Page 10: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/10.jpg)
10
Metadata Quality Assurance Framework
Research hypothesis
proposed solutionMetadata Quality Assurance Framework
![Page 11: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/11.jpg)
11
Metadata Quality Assurance Framework
What to measure?
![Page 12: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/12.jpg)
12
Metadata Quality Assurance Framework
Measurements
Schema-independent structural featuresexistence, cardinality, uniqueness,
length,dictionary entry, data type conformance
Use case scenarios („fit for purpose”)Requirements of the most important
functions
Problem catalogKnown metadata problems
![Page 13: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/13.jpg)
13
Metadata Quality Assurance Framework
Discovery scenarios and their metadata requirements
Europeana’s most important functions
1. Basic retrieval with high precision and recall2. Cross-language recall3. Entity-based facets4. Date-based facets5. Improved language facets6. Browse by subjects and resource types7. Browse by agents8. Browse/Search by Event9. Entity-based knowledge cards and pages10. Categorised similar items11. Spatial search, browse, and map display12. Entity-based autocompletion13. Diversification of results14. Hierarchical search and facets
Credit: the document was initialized by Timothy Hill, Europeana’s search engineer
![Page 14: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/14.jpg)
14
Metadata Quality Assurance Framework
Discovery scenarios and their metadata requirements – Entity-based facets
ScenarioAs a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc.
Metadata analysisIn each case the underlying requirement is that the relevant EDM fields for objects be populated by identifying URIs rather than free text. These URIs need to be related, at a minimum, to a label for each of the supported languages.
Measurement rules The relevant field values should be resolvable URI each URI should have labels in multiple languages
![Page 15: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/15.jpg)
15
Metadata Quality Assurance Framework
Discovery scenarios and their metadata requirements – Date-based facets
ScenarioI want to be able to filter my results by a variety of timespans, e.g.: Date of creation Date of publication Date as subject
Metadata analysisDates should be fully and consistently normalised to follow the XSD date-time data types. Dates expressed in styles like “490 avant J.C” that are inherently language dependent should be avoided as they’re very difficult to normalise (e.g. this should be represented as “-0490”^^xsd:gYear).
Measurement rules Field value should be XSD date-time data types
![Page 16: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/16.jpg)
16
Metadata Quality Assurance Framework
Problem catalog
Catalog of known metadata problems in Europeana
Title contents same as description contents Systematic use of the same title Bad string: "empty" (and variants) Shelfmarks and other identifiers in fields Creator not an agent name Absurd geographical location Subject field used as description field Unicode U+FFFD ( )� Very short description field ...
Credit: the document was initialized by Timoty Hill, Europeana’s search engineer
![Page 17: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/17.jpg)
17
Metadata Quality Assurance Framework
Problem catalog
Description Title contents same as description contentsExample /2023702/35D943DF60D779EC9EF31F5DF...Motivation Distorts search weightingsChecking Method Field comparisonNotes Record display: creator concatenated onto titleMetadata Scenario Basic Retrieval
![Page 18: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/18.jpg)
18
Metadata Quality Assurance Framework
How to define measurements?
![Page 19: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/19.jpg)
19
Metadata Quality Assurance Framework
Problem catalog – proposed basis of implementation
Shapes Constraint Language (SHACL)https://www.w3.org/TR/shacl/
A language for describing and constraining the contents of RDF graphs. It provides a high-level vocabulary to identify predicates and their associated cardinalities, datatypes and other constraints.
sh:equals, sh:notEquals sh:hasValue sh:in sh:lessThan, sh:lessThanOrEquals sh:minCount, sh:maxCount sh:minLength, sh:maxLength sh:pattern
![Page 20: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/20.jpg)
20
Metadata Quality Assurance Framework
early measurement resultsand their visualization
![Page 21: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/21.jpg)
21
Metadata Quality Assurance Framework
overall view collection view record view
Completeness – 40 measurementsField cardinality – 27 measurementsUniqueness – 6 measurementsLanguage specification – 20 measurementsProblem catalog – 3 measurementsetc.
links
measurementsaggregated numbers
![Page 22: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/22.jpg)
22
Metadata Quality Assurance Framework
completenessWhat is the ratio of populated fields in
records?
![Page 23: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/23.jpg)
23
Metadata Quality Assurance Framework
Field frequency / main
![Page 24: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/24.jpg)
24
Metadata Quality Assurance Framework
Field frequency / main
Alternative title is a rare field
![Page 25: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/25.jpg)
25
Metadata Quality Assurance Framework
Field frequency per collections / all
no record has alternative title
every record has alternative title
![Page 26: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/26.jpg)
26
Metadata Quality Assurance Framework
Field frequency per collections / remove no-instances
![Page 27: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/27.jpg)
27
Metadata Quality Assurance Framework
Field frequency per collections / display only complete collections
![Page 28: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/28.jpg)
28
Metadata Quality Assurance Framework
cardinalityHow many field instances are in the
records?
![Page 29: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/29.jpg)
29
Metadata Quality Assurance Framework
Field cardinality – overview
more field than record
number of records
![Page 30: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/30.jpg)
30
Metadata Quality Assurance Framework
Field cardinality – overview
dc:type
![Page 31: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/31.jpg)
31
Metadata Quality Assurance Framework
Field cardinality – histogram
128 subjects in one record
median is 0, mean is close to 1
link to interesting records
![Page 32: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/32.jpg)
32
Metadata Quality Assurance Framework
Field cardinality – an outlier
![Page 33: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/33.jpg)
33
Metadata Quality Assurance Framework
multilingualityDo we know the language of a field
value?
![Page 34: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/34.jpg)
34
Metadata Quality Assurance Framework
Multilinguality
@resource is a URI
@ = language notation in RDF
no language specification
![Page 35: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/35.jpg)
35
Metadata Quality Assurance Framework
Language frequency / barchart
![Page 36: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/36.jpg)
36
Metadata Quality Assurance Framework
Language frequency / barchart
same language, different encodings
![Page 37: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/37.jpg)
37
Metadata Quality Assurance Framework
Language frequency / Treemap
has language specification
has no language specification
![Page 38: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/38.jpg)
38
Metadata Quality Assurance Framework
Language frequency / Treemap with resources
has no language specification
has language specificationIs a URI
![Page 39: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/39.jpg)
39
Metadata Quality Assurance Framework
Language frequency / Treemap + interaction + table
hide/display categories
table-like formal
![Page 40: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/40.jpg)
40
Metadata Quality Assurance Framework
uniqueness (entropy)How unique the terms are in a field?
![Page 41: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/41.jpg)
41
Metadata Quality Assurance Framework
Entropy – term uniqueness / main
1 means a unique term0.0000x means a very frequent term
These are cumulative numbersentropycumolative = term1 + ... + termn
![Page 42: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/42.jpg)
42
Metadata Quality Assurance Framework
Entropy – term uniqueness / collection
max is exceptional (=1425 * mean)
unique records
not or less unique records
![Page 43: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/43.jpg)
43
Metadata Quality Assurance Framework
Entropy – term uniqueness / refining the picture
bulk of records are close to zero
although 25% are between 0.05 and 1.25
![Page 44: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/44.jpg)
44
Metadata Quality Assurance Framework
Entropy – term uniqueness / field value
Russian text in transcribed Latin writing szstem, not in Cyrillic
![Page 45: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/45.jpg)
45
Metadata Quality Assurance Framework
Entropy – term uniqueness / terms
explanation of uniqueness score
TF-IDF values come from Apache Solr
term frequency: 1document freq.: 2uniqueness score: 0.5
![Page 46: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/46.jpg)
46
Metadata Quality Assurance Framework
problem catalogDoes the record have any specific issues?
![Page 47: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/47.jpg)
47
Metadata Quality Assurance Framework
Problem catalog – Long subject
a record with 265 „long” subject heading
![Page 48: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/48.jpg)
48
Metadata Quality Assurance Framework
Problem catalog – Long subject – example (not so long...)
Conclusion: we have to refine the definition of „long”
![Page 49: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/49.jpg)
49
Metadata Quality Assurance Framework
Problem catalog – same title and description
there is one title and description which is the same
... and we have 9 such records
![Page 50: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/50.jpg)
50
Metadata Quality Assurance Framework
Problem catalog – same title and description – example
![Page 51: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/51.jpg)
51
Metadata Quality Assurance Framework
completeness sub-dimensionsAre the sub-dimensions (field groups supporting specific functionalities)
complete?
![Page 52: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/52.jpg)
52
Metadata Quality Assurance Framework
Record view – functionality matrix
existing
missing
functionalities
![Page 53: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/53.jpg)
53
Metadata Quality Assurance Framework
miscellaneous
![Page 54: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/54.jpg)
54
Metadata Quality Assurance Framework
Other elements of the record view
![Page 55: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/55.jpg)
55
Metadata Quality Assurance Framework
Further steps
Incorporating into Europeana’s ingestion tool Process usage statistics (logs, Google Analitics) Human evaluation of metadata quality Measuring timeliness (changes of scores over time) Machine learning based classification & clustering Incorporating into research data management tool Cooperation with other projects
![Page 56: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/56.jpg)
56
Metadata Quality Assurance Framework
Project principles
Scalable, ready for big data Loose coupling to metadata schemas Transparency: open source, open data
(CC0) Release early, release often Getting real [1] Collaboration and communication[1] https://gettingreal.37signals.com/
![Page 57: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/57.jpg)
57
Metadata Quality Assurance Framework
Architectural overview
Apache Spark (Java)
OAI-PMH client (PHP)
Analysis with Spark (Scala) Analysis with R
Web interface(PHP, d3.js)
Hadoop File System
JSON files
Apache Solr
Apache Cassandra
JSON filesJSON files image files
CSV files CSV files
recent workflowplanned workflow
![Page 58: Metadata Quality Assurance Framework at QQML2016 conference - full version](https://reader031.vdocuments.us/reader031/viewer/2022021919/5879f1f71a28ab70298b4e45/html5/thumbnails/58.jpg)
58
Metadata Quality Assurance Framework
Follow me
Europeana Data Quality Committee http://pro.europeana.eu/europeana-tech/data-quality-committee
research plan and blog http://pkiraly.github.io
site http://144.76.218.178/europeana-qa/
source codes https://github.com/pkiraly/europeana-qa-spark https://github.com/pkiraly/europeana-qa-r
@kiru, https://www.linkedin.com/in/peterkiraly