automatically incorporating new sources in keyword-search ... · based data integration sigmod...

Post on 20-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Partha Pratim Talukdar (Microsoft Research) Zack Ives (University of Pennsylvania)

Fernando Pereira (Google)

Automatically Incorporating New Sources in Keyword-Search

based Data Integration

SIGMOD 2010, June 9, 2010

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

2

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

2

-- AnHai Doan

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

2

-- AnHai Doan (Yesterday)

Automatic Data Integration

3

Tables(Data Sources)

Automatic Data Integration

3

Tables(Data Sources)

Info. Need

Automatic Data Integration

3

Tables(Data Sources)

One of the few tables to be joined to answer

user query

Info. Need

Automatic Data Integration

3

Tables(Data Sources)

One of the few tables to be joined to answer

user query

Schema Matching(with errors)

Info. Need

Automatic Data Integration

3

Tables(Data Sources)

New Table

One of the few tables to be joined to answer

user query

Schema Matching(with errors)

Info. Need

Automatic Data Integration

3

Tables(Data Sources)

New Table

One of the few tables to be joined to answer

user query

Schema Matching(with errors)

Info. Need

Automatic Data Integration

3

Tables(Data Sources)

End GoalTo be able to pose integrative queries against a

growing heterogeneous dataset and get meaningful answer.

New Table

One of the few tables to be joined to answer

user query

Schema Matching(with errors)

Info. Need

The Reality Today

4

The Reality Today

• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings

• Fix errors

– Create query forms• Fix errors revealed by bad data

4

The Reality Today

• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings

• Fix errors

– Create query forms• Fix errors revealed by bad data

• But this doesn’t work well for discovery (ad hoc) queries, e.g., in science– Too many sources, queries to administer– Mistakes not revealed until queries posed– Too many attributes for pairwise schema matching

4

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

4

d

c

bP

M

G

Data Sources

1. Discovering Schema Matches

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

1. Discovering Schema Matches

Schema Matchers

1. Discovering Schema Matches

• Metadata Level– COMA++ [Do and Rahm, 2007]

• pairwise column comparisons necessary

COMA++

Schema Matchers

1. Discovering Schema Matches

• Metadata Level– COMA++ [Do and Rahm, 2007]

• pairwise column comparisons necessary

• Instance Level– Based on Modified Adsorption (MAD) [next slide]

• random-walk inspired, previously used in NLP problems• pairwise column comparisons not necessary • parallelizable, suitable for large datasets

COMA++

.

.

.

MAD

Schema Matchers

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

Value Node

Attribute Node

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36

L1 L2 L3 L4 L5

Seed Label (unique)

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36

L1 L2 L3 L4 L5

L3L1

Seed Label (unique)

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36

L1 L2 L3 L4 L5

L3L1

L1L3

L4L5

L5L4L2

Seed Label (unique)

All Labels Propagated in Parallel by MAD

4

d

c

bP

M

G

Data Sources

2. Correcting Matching Errors

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

2. Correcting Matching Errors

10

P b

M

G

c

d

Keyword Matched Sources in Q’s Schema Graph

2. Correcting Matching Errors

10

P b

M

G

c

d

Schema Graph

Keyword Matched Sources in Q’s Schema Graph

2. Correcting Matching Errors

10

0.07

0.1

0.1

0.1

0.04

0.1

P b

M

G

c

d

Schema Graph

Edge Cost Encodes User Preference (lower is better)

Keyword Matched Sources in Q’s Schema Graph

2. Correcting Matching Errors

10

0.07

0.1

0.1

0.1

0.04

0.1

P b

M

G

c

d

Schema Graph

Edge Cost Encodes User Preference (lower is better)

Matching Error: How can we assign it higher (worse) cost?

Correcting Error: Learning New Edge Costs

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Correcting Error: Learning New Edge Costs

Query

Query*

.

.

.

Query

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Correcting Error: Learning New Edge Costs

Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Correcting Error: Learning New Edge Costs

Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

feedback on answers, which is what the user cares about

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Correcting Error: Learning New Edge Costs

Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

updated cost

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

0.5

11

Cost= 0.41

Cost= 0.8

Decomposition of Edge Cost

12

TABLE1 TABLE 2

Decomposition of Edge Cost

12

FeatureName

Matching Cost

Coefficient(Values Learned)

COMA++ Matched 0.90 wCOMA++

MAD Matched 0.7 wLP

--- --- ---

TABLE1 TABLE 2

Decomposition of Edge Cost

12

FeatureName

Matching Cost

Coefficient(Values Learned)

COMA++ Matched 0.90 wCOMA++

MAD Matched 0.7 wLP

--- --- ---

TABLE1 TABLE 2

Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP

Decomposition of Edge Cost

12

FeatureName

Matching Cost

Coefficient(Values Learned)

COMA++ Matched 0.90 wCOMA++

MAD Matched 0.7 wLP

--- --- --- Learned

TABLE1 TABLE 2

Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP

Learning: Incorporating User Feedback

• Model feedback incorporation as a constrained optimization problem.

13

Learning: Incorporating User Feedback

MIRA Algorithm(Crammer et al., 2006)

• Model feedback incorporation as a constrained optimization problem.

13

Learning: Incorporating User Feedback

MIRA Algorithm(Crammer et al., 2006)

• Model feedback incorporation as a constrained optimization problem.

13

New Model

Parameters

CurrentModel

Parameters

Learning: Incorporating User Feedback

MIRA Algorithm(Crammer et al., 2006)

• Model feedback incorporation as a constrained optimization problem.

13

New Model

Parameters

CurrentModel

Parameters

Tree Cost

Loss

Learning: Incorporating User Feedback

MIRA Algorithm(Crammer et al., 2006)

Tree whose tuples user likes

Tree whose tuples user doesn’t like.

• Model feedback incorporation as a constrained optimization problem.

13

New Model

Parameters

CurrentModel

Parameters

Tree Cost

Loss

4

d

c

bP

M

G

Data Sources

3. Where to Align New Source?

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

15

Where to Match a New Source?

3

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

3

Keywords

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

3

Neighborhood imposed by cost of

kth best answer.

Keywords

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

?New

Source

3

Neighborhood imposed by cost of

kth best answer.

Keywords

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

?New

Source

3

Matchings outside this neighborhood is not going to affect k-best answers

(i.e., current view).

Neighborhood imposed by cost of

kth best answer.

Keywords

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

?New

Source

3

View Based AlignerConsider only those matchings which are likely to affect query

results, as otherwise there will be no feedback from user.

Matchings outside this neighborhood is not going to affect k-best answers

(i.e., current view).

Neighborhood imposed by cost of

kth best answer.

Keywords

Experiments

16

Experiments

Two questions:

16

Experiments

Two questions:I. Can we repair alignment errors by exploiting

user feedback over answers?

16

Experiments

Two questions:I. Can we repair alignment errors by exploiting

user feedback over answers?

II.Can we reduce the number of pairwise comparisons necessary during alignment discovery for new source?

16

1. Correcting Schema Matching Errors: Setup

17

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

• Start with just the tables

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

• Start with just the tables

• Use automatic schema matchers (e.g., COMA++, MAD)

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

• Start with just the tables

• Use automatic schema matchers (e.g., COMA++, MAD)

• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

• Start with just the tables

• Use automatic schema matchers (e.g., COMA++, MAD)

• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)

• Compute precision-recall w.r.t. the gold matchings (left figure)

I. Correcting Schema Matching Errors

18

I. Correcting Schema Matching Errors

18

0.25

0.438

0.625

0.813

1

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

Precision-Recall Plots for Various Methods

Pre

cisi

on

Recall

COMA++ MAD Q

I. Correcting Schema Matching Errors

18

0.25

0.438

0.625

0.813

1

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

Precision-Recall Plots for Various Methods

Pre

cisi

on

Recall

COMA++ MAD Q

Learning with Q helps correct schema

matching errors.

I. Correcting Schema Matching Errors (contd.)

19

I. Correcting Schema Matching Errors (contd.)

19

0.15

1.363

2.575

3.788

5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Gold vs Non-Gold Edge Costs After Increasing Feedback

Ave

rage

Ed

ge C

ost

s (L

ow

er

is B

ett

er)

Feedback Step Number

Avg. Gold Edge CostAvg. Non-Gold Edge Cost

I. Correcting Schema Matching Errors (contd.)

19

0.15

1.363

2.575

3.788

5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Gold vs Non-Gold Edge Costs After Increasing Feedback

Ave

rage

Ed

ge C

ost

s (L

ow

er

is B

ett

er)

Feedback Step Number

Avg. Gold Edge CostAvg. Non-Gold Edge Cost

Learning with Q helps identify the correct (gold) alignments.

20

II. Reducing Pairwise Comparisons during New Source Integration

20

II. Reducing Pairwise Comparisons during New Source Integration

0

5000.0

10000.0

15000.0

20000.0

18 100 500

# P

air

wis

e C

om

pari

son

s

Number of Tables in the Schema Graph

Exhaustive ViewBasedAligner

20

II. Reducing Pairwise Comparisons during New Source Integration

0

5000.0

10000.0

15000.0

20000.0

18 100 500

# P

air

wis

e C

om

pari

son

s

Number of Tables in the Schema Graph

Exhaustive ViewBasedAligner

View Based Aligner Significantly Reduces the Number of Comparisons.

Related Work• B. Alexe, L. Chiticariu, R. J. Miller, and W.-C. Tan. Muse: Mapping

understand- ing and design by example. In ICDE 2008

• Laura Chiticariu, Phokion G. Kolaitis, Lucian Popa: "Interactive Generation of Integrated Schemas". SIGMOD Conference 2008

• Anish Das Sarma, Luna Dong, Alon Halevy. Bootstrapping Pay-As-You-Go Data Integration System. SIGMOD 2008

• Fagin+, Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling: Foundations and Applications 2009

• S.R. Jeffery, M.J. Franklin, and A.Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD, 2008

• Talukdar+. Learning to create data-integrating queries. In VLDB, 2008.

21

Summary

22

Summary

• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable

22

Summary

• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable

• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching

errors

22

Summary

• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable

• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching

errors

• Integrates new sources– through incremental updates to schema matchings

22

Thank You!

Poster: Tomorrow (Thu), 3:30pm Cosmopolitan AB

top related