linked humanities data
DESCRIPTION
TRANSCRIPT
Linked Humanities Data:The Next Frontier?
A Case-Study in Historical Census Data
Albert Meroño-PeñuelaKnowledge Representation & Reasoning Group
29-10-2012
Linked Humanities Data: The Next Frontier? 2
The Dutch historical censuses (1795-1971)
29-10-2012
Linked Humanities Data: The Next Frontier? 3
The Dutch historical censuses (1795-1971)
29-10-2012
Linked Humanities Data: The Next Frontier? 4
The Dutch historical censuses (1795-1971)
• Population, Houses and Occupation censuses
• 507 Excel files• 2,288 tables• 33,283
annotated cells
29-10-2012
Linked Humanities Data: The Next Frontier? 5
Heterogeneity: structural
29-10-2012
Linked Humanities Data: The Next Frontier? 6
Heterogeneity: semantic
• Variable meaning– Plaatselijke indeling / Kom, buiten de kom + Wijk +
Naam / Plaats– Variable design (age 14-18, 19-20 vs. 14-15, 16-20)
• Variable values– RomschKatholik, RomsKatholic, VaticanChristelijk– Change in municipalities, occupations
29-10-2012
Linked Humanities Data: The Next Frontier? 7
(Current) Harmonization
• Manually create a (more general) translation table using standard CS– Map occupation literals with HISCO codes– Map municipality literals with AC codes
• Cons– Expensive– Detail/specificity loss– Process is non-repeatable
29-10-2012
Linked Humanities Data: The Next Frontier? 8
Additional requirements
• Errors: non-destructive update of values• Provenance: record who did what, when, why• Datamodel: do not commit to a specific one• Linkage: enrich the dataset by linking it to
others (e.g. labour strikes, book publications in NL)
• Publication: open data for researchers
29-10-2012
Linked Humanities Data: The Next Frontier? 9
Census RDF: arch
29-10-2012
• RDF Data Cube Vocabulary (cell data)
• D2S Vocabulary (layout data)
• Open Annotation Core Data Model (annotation data)
Linked Humanities Data: The Next Frontier? 10
Census RDF: cell data
29-10-2012
Linked Humanities Data: The Next Frontier? 11
Census RDF: layout data
29-10-2012
Linked Humanities Data: The Next Frontier? 12
Census RDF: annotation data
29-10-2012
Linked Humanities Data: The Next Frontier? 13
Querying the RDF’d census
29-10-2012
Linked Humanities Data: The Next Frontier? 14
Not ready-to-publish RDF
• Disconnected graphs (but 279,136 possible variable mappings!)
• Complex & non-homogeneous SPARQL queries• Contradictory annotation statements• Drifted concepts– Tile settler -> roof repairer– Shoemaker (works with leather) -> shoemaker (owns a
company)
29-10-2012
Linked Humanities Data: The Next Frontier? 15
New challenges
• Dynamic ontologies– Different concept formalizations depending on the
time frame– Subjective definitions (contested concepts)
• Partitions and counting– Cannot merge counts of non aligned concepts– Infer individuals?
• Format round-tripping– On-demand XLS, CSV, RDF, RDB conversions with(out)
data loss29-10-2012
Thank you!Questions, suggestions?
http://cedar-project.nl/http://www.data2semantics.org/