publish -time data integration for open data platforms

16
Dipl. Medien-Inf. Julian Eberius | Publish-Time Data Integration for Open Data Platforms WOD 2013 lian Eberius , Patrick Damme, Katrin Braunschweig, ik Thiele and Wolfgang Lehner (TU Dresden)

Upload: nigel

Post on 24-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

WOD 2013. Publish -Time Data Integration for Open Data Platforms. Julian Eberius , Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner (TU Dresden) . Motivation. Premise. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius |

Publish-Time Data Integration for Open Data Platforms

WOD 2013

Julian Eberius, Patrick Damme, Katrin Braunschweig,Maik Thiele and Wolfgang Lehner (TU Dresden)

Page 2: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | 2

> Motivation

Page 3: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 3

> Premise

Reusability• Standardization• Integration

Free-For-All• Many contributors• Many domains• Lack of standards

Continuous publishing without standardization will continuously increase heterogeneity on the platform.

Is there a solution without predefined schemata / ontologies?

Page 4: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 4

> Problem

Different names for attributes of the same meaning Different meanings for attributes with

same values

Page 5: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 5

> System Overview

Page 6: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 6

> Offline

Domain Clustering Bottom-up clustering on schema-

level Used online to limit search space But also to improve accuracy

Domain Statistics Create different forms value set

synopses Used to save comparison work

online

Page 7: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 7

> Online

Input New dataset ds+ with value

sets vs+

Output Attribute name suggestions

Constraint Instanteneous response

time (Publish-Time!)

Basic Approach Assign ds+ to domain based

on schema information Generate recommendations

based on values

Page 8: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 8

> Naiv-C

Most Naive Approach: Iterate over Corpus C return the names of all attributes with

sufficiently similar value sets order them by overall frequency in the

corpus

Properties: Finds all similar value sets Generates the largest possible number of

recommendations Extremely long run time Might generate to many

recommendations

Page 9: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 9

> Naiv-D

Domain-based Approach: Classify incoming dataset into domain D Iterate over Domain D continue as in Naiv-C

Properties: Finds less similar value sets Shorter run time Only generates recommendations from

one domain

Page 10: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 10

> Cluster / Analysis-D

Synopsis-based Approaches: Create representative value sets RVS for

datasets in domain Match only against RVS

Clustering-D Cluster VS in domain, create RVS Pre-compute recommendation list as all

attribute names of value sets participating in final cluster

Online: find single most similar RVS in D

Analysis-D Create RVS directly for sets of VS with

equal name Online: Find set of similar RVS in D

Page 11: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 11

> Evaluation

Page 12: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 12

> Quality I

Page 13: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 13

> Quality II

Page 14: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 14

> Runtimes

Page 15: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 15

> Cluster Size

Page 16: Publish -Time Data Integration  for Open Data  Platforms

Dipl. Medien-Inf. Julian Eberius | | 16

> Conclusion

We need statistics-based data integration at publish time to limit the growth of heterogenity in large public dataset corpora.

Lots of work to do: clustering, matching, statistics, indexing, performance.