fledex: flexible data exchange eli cortez, altigran silva federal university of amazonas, brazil...

Post on 12-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

FleDEx: Flexible Data Exchange

Eli Cortez, Altigran Silva

Federal University of Amazonas, Brazil

WIDM’07

Filipe Mesquita, Denilson Barbosa

University of Alberta, Canada

Originaldata

The Data Exchange Problem is…

…translating data from a source schema to a target schema

Source schem

a

Translateddata

Target schema

translate

Existing Solutions are Complex for Non-Experts Data are kept in databases Tools are used to help the translation from a

database schema to another, such as Clio Non-experts do not have the skills nor the

resources to set up databases and use mapping tools

SourceDB

TargetDB

“Data Exchange for the Masses” We propose a lightweight framework for

non-experts to share data, where: Data are kept in collections outside a DBMS Schema is not always available Users move only small portions of a data

collection at a time It is possible that two users may exchange

data once and never again

A Motivating Application

My Collection My Collection

Peer-to-peer data sharing systems

Several formats (XML, CSV, ...) Casual connections

Share

Internet

Example

Source collection – CSV format

Target collection – XML format…<artist name=“Miles Davis>

<CD title=“Kind of Blue” style=“Instrumental”><song title=“So What”><song title=“All Blues”>

</CD></artist><artist name=“Norah Jones”>

<CD title=“Not Too Late”></artist>…

Artist, Instrument, Album, PriceM. Davis, Trumpet, Kind of Blue, $7.97L. Armstrong, Trumpet, On the Road, $5.98J. Coltrane, Saxophone, Giant Steps, $10.99

translate according to

Schema NOT provided!

Target collection’s data

is available!

Data Exchange is NOT Data Integration Fagin et al. [Theor. Comp. Sci.’05], who laid

down the foundations of the data exchange problem, wrote:

“A more significant difference between data exchange and data integration is that […] we have to actually materialize a finite target instance that best reflects the given source instance. In data integration no such exchange of data is required”

Data Exchange and Schema Matching Clio [VLDB’02] translates data between

databases once schemas are matched Unlike Clio, our approach does not require

setup investment and user intervention Several solutions for matching schemas are

discussed by Rahm and Bernstein’s survey [SIGMOD’01]

Most of them exploit schema information (e.g. labels), which does not work well in our setting, as our experiments show

FleDEx Framework

FleDEx Data Model (FDM): A minimalist generic hierarchical data model that captures essential features of XML and tabular data

Data Fitting: A algorithm for restructuring instances of our data model according to a target schema

FDM Instance is a Tree

Entities are represented by round rectangles

Attributes are textual nodes stemming out of entities

Attribute values are shown in italics

FDM Schema

Boxes represent entity types

Ovals represent attributes

The arrows indicate the attributes of the entities and the way they can be nested

Hollow arrows represent indicate optional attributes

Converting XML to FDM

<artist name=“Norah Jones”><CD title=“Not Too Late”>

<song track=“1”><title> Wish I could </title>

</song><song track=“2”>

<title> Sinkin’ soon </title></song>

</artist>

The Data Fitting Algorithm

1. Find a mapping of corresponding attributes in source and target schemas

2. Translate instances using such a mapping

Source schema Target schema

Similarity Components

Keyword-based similarity Attribute vocabularies {Davis, Norah, …} vs. {Miles, Davis, …}

Value-based similarity Shared Values “Kind of Blue” vs. “Kind of Blue”

Label similarity Names of entities and attributes artist/name vs. album/artist

A Bayesian Network for Combining Components

The OR operator:

K – KeywordV – ValueC – ContentL – LabelF – Final similarity

Avoiding Redundancy

Consider translating:“genre→album” and “song→track”…

We have to repeat all tracks for each album’s style! (Cartesian product)

Conflicts

To avoid that, we say that genre→album has a conflict with song→track.

Consequently, (a) has a conflict with both (b) and (c)

Solution: remove (a) or (b),(c) Thus, we are looking for the best

mapping without conflicts, which is an NP-complete optimization problem

Solving Conflicts

Let G(V,E) be a graph where V contains entity pairs and E contains edges as conflicts between them

We want to remove entity pairs with low score that produce conflicts

This is equivalent of finding a minimum-vertex cover in G

We use a heuristic for approximate results

genre→album0.5

conflict

song→track0.9

Final Attribute Mapping

Is a injective function with no conflicts

Translating Instances

This does not entail only relabeling but may also involve structural changes (e.g. different nesting)

First step: flatten data into a relation, where there is no particular nesting

Second step: create entities and attributes according to the target structure for each tuple

Example

artist album.title track.title num

Norah Jones Not Too Late Wish I could 1

Norah Jones Not Too Late Sinkin’ soon 2

Original instance Flatten to a relation

Translated instance

Our Translation Process…

Preserves the semantics of the source instance: Preserve ancestor-descendant relationships

between source entities Preserve sibling relationships between

source attributes Is unambiguous – there is a unique way of

restructuring instances Since our simplistic data model relates

entities through nesting only

Experiments

Goal: produce good mappings Metric: F-measure (harmonic mean of

precision and recall) Datasets:

Effectiveness of the Data Fitting Score

The combined score outperformed all individual scores

50 runs with 10 source entities each

Impact of the Size of Source Instance

Movies data collection with 20 runs for each plot

The combined method again outperformed the others, especially for smaller source instances

Impact of the Size of Target Instance

5 runs with 10 source entities each

Plots for simple collections are high regardless of their size

For more complex collections, the curve improves as the size increases

Resilience to noise

20 runs with 10 movies in the source instance each

The combined similarity suffers the least relative drop remaining almost perfect even when only

1/3 of the attributes have a match

Conclusion

Our method is particularly attractive for non-expert and casual users

It does not require the data to be stored in database systems, nor the use of special-purpose schema mapping tools

Our data model is simple yet powerful enough for the setting considered

Finally, extensive experimental results with real Web data showed that our approach is effective and very promising

Thank You!

top related