cedar & prelida preservation of linked socio-historical data

Post on 01-Jul-2015

216 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

by Albert Meroño, presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu

TRANSCRIPT

CEDAR & PRELIDA Preservation of Linked Socio-

Historical Data

Albert Meroño-Peñuela@albertmeronyo

PRELIDA consolidation workshop @ ISWC, 17-10-2014

CEDAR: Harmonizing Historical Census Data in the Semantic Web

CEDAR: Source Historical DataDutch Historical Censuses (1795-1971)

[Public Historical Statistical Data]

4

From scans to spreadsheets

CEDAR goal: cross queries

?

1795 1830 1889 1930 1971

(through ~3K tables)

Towards 5-star Census Data

Towards 5-star Census Data

>1 year ago

1 year ago

• Web publishable• Machine processable• Dynamic schema• Easily link with other

datasets

Why with semantic technology?

• Web publishable, human & machine readable

• Finer granularity level (cell level)

• Statistical comparability by leveraging semantic descriptions

• Provenance

• Harmonization through linkage to other datasets (the 5th star)

RDF Data Cube

“There are many situations where it would be useful to be able to publish multi-dimensional data, such as

statistics, on the web in such a way that they can be linked to related data sets and concepts.”

RDF Data Cube vocabulary (QB)• SDMX compatible• Defines cubes as a set of observations that consist of

dimensions, measures and attributes

• Dimensions: time period, region, sex (qb:DimensionProperty)• Measure: population life expectancy (qb:MeasureProperty)

• Attribute: unit of measure = years, metadata status = measured (qb:AttributeProperty)

Observation: “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years”

CEDAR Integrator

https://github.com/CEDAR-project/Integrator

Raw data

cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ;

rdfs:label "K17";

tablink:value "12.0" ;

tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-K4 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-K5 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ;

tablink:sheet cedar:BRT_1889_08_T1-S0 .

Harmonization Rules as Open Annotations

cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ;

oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ;

oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ;

oa:serializedAt "2014-09-24"^^xsd:date ;

oa:serializedBy

<https://github.com/CEDAR-project/Integrator> ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-mapping-activity .

cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ;

sdmx-dimension:sex sdmx-code:sex-F .

Harmonized RDF Data Cube

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:decimal ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

Classification Systems and Concept Schemes

• Some missing harmonized dimensions!• Encode all variables and their values using concept

schemes• Some already exist

– Which ones? How many of them?– Where? – By whom?– Are they used at all? Can I reuse them?

• Some need to be created– Manual and expert knowledge based– Can we do it automatically? Or assist the process?

Dutch Historical

Censuses

(CEDAR)

Dutch Ships

and Sailors

Gemeente

geschiede

nis.nl

HISCO

ICONCLASS

Dutch

Historical

Religions

Dutch

Historical

House Types

Existing dimensions

• HISCO

http://historyofwork.iisg.nl/

Existing dimensions

• Gemeentegeschiedenis.nl

Existing LSD dimensions

• P1: Discoverability? How to discover dimensions created by others?

• P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others?

• P3: Relevance? What’s the size of LSD?

LSD Dimensions

http://lsd-dimensions.org/https://github.com/albertmeronyo/LSD-Dimensions

Hourly JSON-LD dumps

http://lsd-dimensions.org/

Existing LSD dimensions

• P1: Discoverability? How to discover dimensions created by others? LSD Dimensions

• P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others? Logarithmic law / probably yes

• P3: Relevance? What’s the size of LSD? ~7.9% of the LOD cloud

Creating new LSD Dimensions

• CEDAR needs concept schemes for

– Historical religious denominations (i.e. religions in the NL in 18th-20th c.)

– Historical occupations (id.)

– Historical building types (id.)

https://github.com/CEDAR-project/TabCluster

TabCluster

Leverages● Lexical properties

○ Hierarchical clustering in Python scipy○ String distances

● Semantic properties (LOD tagging)○ skos:Concept of most frequent cluster-term○ Closest common skos:broader skos:Concept of all

cluster-terms

Compatibility? Remixability? Reusability?

Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics (SemStats) ISWC 2014.

Concept Drift

Census classification of occupations as for

1859

• Root node is void• Depth 1: occupation groups• Leaves: actual occupations

Concept Drift

Census classification of occupations as for

1889

• Root node is void• Depth 1: occupation groups• Leaves: actual occupations

Concept Drift

Census classification of occupations as for

1899

• Root node is void• Depth 1: occupation groups• Leaves: actual occupations

Concept Drift

Upper ontologies

(HISCO, AC)

Year-

dependent

ontologies

1859 1869 1879

Concept Drift

Upper ontologies

(HISCO, AC)

Year-

dependent

ontologies

Concept Drift

Upper ontologies

(HISCO, AC)

Year-

dependent

ontologies

? ?

Preserving CEDAR

Preserving CEDAR

• DANS-EASY as backend (http://easy.dans.knaw.nl/)

• Archived objects: Turtle snapshots

– 20Go uncompressed, 200Mo compressed (per snapshot)

– Versioning (stats on current release)

• Users still need to

– SPARQL the data => bring up the endpoint on demand

– Run analytics on the data => outsource statistical analysis

Thank you

Questions, suggestions, comments most welcome

@albertmeronyo

http://www.cedar-project.nlhttp://krr.cs.vu.nl/

http://easy.dans.knaw.nl/http://lsd-dimensions.org/

Me in 6 tweetshttp://www.albertmeronyo.org

• Background: Computer Science, Web hacker, AI & Law

• PhD candidate at the VU University Amsterdam, DANS, and eHumanities group (KNAW)

• Topic: Semantic Web for the Humanities

• CEDAR project (2012-2015): harmonized historical Dutch censuses in the Semantic Web

• Problem: statistical data publishing, concept drift and dynamics of meaning

• Last paper: What is Linked Historical Data? (EKAW 2014)

top related