automatically incorporating new sources in keyword-search ... · based data integration sigmod...

Partha Pratim Talukdar (Microsoft Research) Zack Ives (University of Pennsylvania)

Fernando Pereira (Google)

Automatically Incorporating New Sources in Keyword-Search

based Data Integration

SIGMOD 2010, June 9, 2010

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

-- AnHai Doan

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

-- AnHai Doan (Yesterday)

Automatic Data Integration

Tables(Data Sources)

Info. Need

One of the few tables to be joined to answer

user query

Info. Need

user query

Schema Matching(with errors)

Info. Need

New Table

user query

Info. Need

New Table

user query

Info. Need

End GoalTo be able to pose integrative queries against a

growing heterogeneous dataset and get meaningful answer.

New Table

user query

Info. Need

The Reality Today

• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings

• Fix errors

– Create query forms• Fix errors revealed by bad data

The Reality Today

• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings

• Fix errors

– Create query forms• Fix errors revealed by bad data

• But this doesn’t work well for discovery (ad hoc) queries, e.g., in science– Too many sources, queries to administer– Mistakes not revealed until queries posed– Too many attributes for pairwise schema matching

Data Sources

Q: Query-driven, Admin-Free Integration

Data Sources

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

Data Sources

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

Data Sources

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

Results + feedback

Data Sources

NNewSource

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

Results + feedback

Data Sources

NNewSource

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

Results + feedback

Data Sources

1. Discovering Schema Matches

NNewSource

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

Results + feedback

Schema Matchers

• Metadata Level– COMA++ [Do and Rahm, 2007]

• pairwise column comparisons necessary

COMA++

Schema Matchers

• Metadata Level– COMA++ [Do and Rahm, 2007]

• pairwise column comparisons necessary

• Instance Level– Based on Modified Adsorption (MAD) [next slide]

• random-walk inspired, previously used in NLP problems• pairwise column comparisons not necessary • parallelizable, suitable for large datasets

COMA++

Schema Matchers

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

Value Node

Attribute Node

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

AT1G36

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB2.GO_ID

AT1G36

L1 L2 L3 L4 L5

Seed Label (unique)

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB2.GO_ID

AT1G36

L1 L2 L3 L4 L5

Seed Label (unique)

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB2.GO_ID

AT1G36

L1 L2 L3 L4 L5

L5L4L2

Seed Label (unique)

All Labels Propagated in Parallel by MAD

Data Sources

2. Correcting Matching Errors

NNewSource

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

Results + feedback

Keyword Matched Sources in Q’s Schema Graph

Schema Graph

Edge Cost Encodes User Preference (lower is better)

Schema Graph

Edge Cost Encodes User Preference (lower is better)

Matching Error: How can we assign it higher (worse) cost?

Correcting Error: Learning New Edge Costs

Bottom

0.10.04

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Query*

Bottom

0.10.04

Cost= 0.4

Cost= 0.41

Query*

Tuples

Bottom

0.10.04

Cost= 0.4

Cost= 0.41

Query*

Tuples

Bottom

feedback on answers, which is what the user cares about

0.10.04

Cost= 0.4

Cost= 0.41

Query*

Tuples

Bottom

updated cost

0.10.04

Cost= 0.41

Cost= 0.8

Decomposition of Edge Cost

TABLE1 TABLE 2

FeatureName

Matching Cost

Coefficient(Values Learned)

COMA++ Matched 0.90 wCOMA++

MAD Matched 0.7 wLP

--- --- ---

TABLE1 TABLE 2

FeatureName

Matching Cost

MAD Matched 0.7 wLP

--- --- ---

TABLE1 TABLE 2

Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP

FeatureName

Matching Cost

MAD Matched 0.7 wLP

--- --- --- Learned

TABLE1 TABLE 2

Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP

Learning: Incorporating User Feedback

• Model feedback incorporation as a constrained optimization problem.

MIRA Algorithm(Crammer et al., 2006)

New Model

Parameters

CurrentModel

Parameters

New Model

Parameters

CurrentModel

Parameters

Tree Cost

Tree whose tuples user likes

Tree whose tuples user doesn’t like.

New Model

Parameters

CurrentModel

Parameters

Tree Cost

Data Sources

3. Where to Align New Source?

NNewSource

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

Results + feedback

Where to Match a New Source?

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

0 0 0 0 0 0

0 00 0

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Keywords

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

plasma membrane

0 0 0 0 0 0

0 00 0

Neighborhood imposed by cost of

kth best answer.

Keywords

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

plasma membrane

0 0 0 0 0 0

0 00 0

Source

kth best answer.

Keywords

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

plasma membrane

0 0 0 0 0 0

0 00 0

Source

Matchings outside this neighborhood is not going to affect k-best answers

(i.e., current view).

kth best answer.

Keywords

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

plasma membrane

0 0 0 0 0 0

0 00 0

Source

View Based AlignerConsider only those matchings which are likely to affect query

results, as otherwise there will be no feedback from user.

Matchings outside this neighborhood is not going to affect k-best answers

(i.e., current view).

kth best answer.

Keywords

Experiments

Two questions:

Experiments

Two questions:I. Can we repair alignment errors by exploiting

user feedback over answers?

Experiments

Two questions:I. Can we repair alignment errors by exploiting

user feedback over answers?

II.Can we reduce the number of pairwise comparisons necessary during alignment discovery for new source?

1. Correcting Schema Matching Errors: Setup

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

go_term

interpro_journal

• Start with just the tables

go_term

interpro_journal

• Use automatic schema matchers (e.g., COMA++, MAD)

go_term

interpro_journal

• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)

go_term

interpro_journal

• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)

• Compute precision-recall w.r.t. the gold matchings (left figure)

I. Correcting Schema Matching Errors

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

Precision-Recall Plots for Various Methods

Recall

COMA++ MAD Q

I. Correcting Schema Matching Errors

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

Precision-Recall Plots for Various Methods

Recall

COMA++ MAD Q

Learning with Q helps correct schema

matching errors.

I. Correcting Schema Matching Errors (contd.)

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Gold vs Non-Gold Edge Costs After Increasing Feedback

Feedback Step Number

Avg. Gold Edge CostAvg. Non-Gold Edge Cost

I. Correcting Schema Matching Errors (contd.)

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Gold vs Non-Gold Edge Costs After Increasing Feedback

Feedback Step Number

Avg. Gold Edge CostAvg. Non-Gold Edge Cost

Learning with Q helps identify the correct (gold) alignments.

II. Reducing Pairwise Comparisons during New Source Integration

5000.0

10000.0

15000.0

20000.0

18 100 500

Number of Tables in the Schema Graph

Exhaustive ViewBasedAligner

II. Reducing Pairwise Comparisons during New Source Integration

5000.0

10000.0

15000.0

20000.0

18 100 500

Number of Tables in the Schema Graph

Exhaustive ViewBasedAligner

View Based Aligner Significantly Reduces the Number of Comparisons.

Related Work• B. Alexe, L. Chiticariu, R. J. Miller, and W.-C. Tan. Muse: Mapping

understand- ing and design by example. In ICDE 2008

• Laura Chiticariu, Phokion G. Kolaitis, Lucian Popa: "Interactive Generation of Integrated Schemas". SIGMOD Conference 2008

• Anish Das Sarma, Luna Dong, Alon Halevy. Bootstrapping Pay-As-You-Go Data Integration System. SIGMOD 2008

• Fagin+, Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling: Foundations and Applications 2009

• S.R. Jeffery, M.J. Franklin, and A.Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD, 2008

• Talukdar+. Learning to create data-integrating queries. In VLDB, 2008.

Summary

• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable

Summary

• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching

errors

Summary

• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching

errors

• Integrates new sources– through incremental updates to schema matchings

Thank You!

Poster: Tomorrow (Thu), 3:30pm Cosmopolitan AB

automatically incorporating new sources in keyword-search ... · based data integration sigmod...

Documents

stable explicit time marching in well-posed or ill-posed...

originally published in: research collection sigmod record

efficient parallel set-similarity joins using mapreduce -...

cmu cs outline part1 similarity search, motivation...

my tutorial on sensordb design issues at sigmod 2007

ajax-based report pages as incrementally rendered...

sigmod 2008 tutorial, june 10th,...

proceedings of the1989 acm sigmod international conference...

zhen zhang seung-won hwang kevin c. chang min wang christian...

keyword-based search and exploration on databases (sigmod...

a bibliography of acm sigmod...

ajax-based report pages as incrementally rendered views (...

sigmod’03 evaluating probabilistic queries over imprecise...

teaser talks thursday - may 18th 2017 sigmod...

hierarchically organized skew-tolerant histograms for...

numerical solution of ill-posed cauchy problems...

data presentations cassandra sigmod

privacy-aware data management in information networks -...

lineage-driven fault injection, sigmod'15

query execution in column- oriented database...