data fusion jens bleiholder and felix naumann presented by aaron stewart

35
Data Fusion Jens Bleiholder and Felix Naumann Presented by Aaron Stewart

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Data Fusion

Jens Bleiholder and Felix Naumann

Presented by Aaron Stewart

Data Integration

• Schema mapping

• Duplicate detection

• Data fusion

Complete / Concise

• Like recall/precision

• Complete: coverage of real-world objects

• Concise: avoid duplicates

Conflicts

• Schematic conflicts

• Identity conflicts

• Data conflicts

• Uncertainty

• Contradiction

Data Fusion Strategies

Uniqueness

• Uniqueness-preserving

• Uniqueness-enforcing

Value preservation

• Value-preserving

• Non-value-preserving

• Object-preserving

Motivating Example

Joins

• Equi-join

• Natural join

• Full outer join– Key join– Left join– Right join

Equi-joinSELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status,

U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone

FROM U1 JOIN U2 ON U1.Name=U2.Name

Equi-join Result

SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status,

U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone

FROM U1 JOIN U2 ON U1.Name=U2.Name

Natural JoinSELECT U1.Name, U1.Age, U1.Status, U1.Address, U1.Field,

U1.Library, U2.PhoneFROM U1 JOIN U2 ON U1.Name=U2.Name AND U1.Age=U2.Age

AND U1.Status=U2.Status AND U1.Address=U2.AddressAND U1.Field=U2.Field

Natural Join ResultSELECT U1.Name, U1.Age,

U1.Status, U1.Address, U1.Field, U1.Library, U2.Phone

FROM U1 JOIN U2 ON U1.Name=U2.Name AND U1.Age=U2.AgeAND U1.Status=U2.Status AND U1.Address=U2.AddressAND U1.Field=U2.Field

Full Outer JoinSELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status,U1.Address, U2.Address, U1.Field, U2.Field, U1.Library,

U2.PhoneFROM U1 FULL OUTER JOIN U2 ON U1.Name=U2.Name

Full Outer Join ResultSELECT U1.Name, U2.Name,

U1.Age, U2.Age, U1.Status, U2.Status,

U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone

FROM U1 FULL OUTER JOIN U2 ON U1.Name=U2.Name

Full Disjunction

• Generalizes outer join to more than two tables

Information Systems for Data Fusion

1. Conflict resolution

2. Conflict avoidance

3. Conflict ignorance

4. No conflict handling

Architecture

• Database management system (DBMS)

• Multidatabase management system (MDBMS)

• Mediator-wrapper (MW)

• Multi-agent system (MAS)

• Stand-alone application (APP)

Integration Model

• Global-as-view (GaV)

• Local-as-view (LaV)

• Global-Local-as-view (GLaV)

1. Conflict-Resolving Systems

• Multibase

• Hermes

• Fusionplex

• HumMer

• Ajax

Multibase

• C. 1983

• Solution:– Outer join– Aggregation (min, max, sum, choose, etc.)

Hermes

• HEterogeneous Reasoning and MEdiator System

• C. 1996

• Mediator-specified conflict resolution– Created by an expert

Fusionplex

• Multiplex, Fusionplex, Autoplex

• Classifies quality of data

• User-prioritized feature “importance”

• Able to incorporate new/unknown databases

HumMer

• Humboldt-Merger

• C. 2006

• Handles conflicts in schema, identity, data

• Clusters duplicates

• User-defined aggregation functions

Ajax

• Format and unit conversion

• User-defined cleansing process– Compiled to Java

2. Conflict-Avoiding Systems

• TSIMMIS

• SIMS and Ariadne

• Infomix

• HIPPO

• ConQuer

• Rainbow

Conflict-Ignoring Systems

• Pegasus

• Nimble

• Carnot

• InfoSleuth

• Potter’s Wheel

Other Systems

• Research Systems– Trio– Information Manifold– Garlic– Disco (Distributed Information Search Component)– Papyrus, Nomenclature– DIOM, KOMET, Infomaster, Occam, SIMS, Internet

Softbot– Singapore, Magic, Observer– Lore, Tukwila– SIRIUS-DELTA, DDTS, Mermaid, UNIBASE– MRDSM, OMNIBASE, CALIDA, DQS

Other Systems

• Commercial– IBM, Oracle, Microsoft, others– IBM Information Server (IIS)– Microsoft SQL Server Integration Services

(SSIS)

Other Systems

• Peer Data Management Systems– Orchestra– Hyper

Analysis

• Weaknesses– Difficult to show utility of a tool on paper

• Strengths– Covered a lot of theory– Covered a lot of systems