automatically incorporating new sources in keyword-search ... · based data integration sigmod...

81
Partha Pratim Talukdar (Microsoft Research) Zack Ives (University of Pennsylvania) Fernando Pereira (Google) Automatically Incorporating New Sources in Keyword-Search based Data Integration SIGMOD 2010, June 9, 2010

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Partha Pratim Talukdar (Microsoft Research) Zack Ives (University of Pennsylvania)

Fernando Pereira (Google)

Automatically Incorporating New Sources in Keyword-Search

based Data Integration

SIGMOD 2010, June 9, 2010

Page 2: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

2

Page 3: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

2

-- AnHai Doan

Page 4: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

2

-- AnHai Doan (Yesterday)

Page 5: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Automatic Data Integration

3

Tables(Data Sources)

Page 6: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Automatic Data Integration

3

Tables(Data Sources)

Info. Need

Page 7: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Automatic Data Integration

3

Tables(Data Sources)

One of the few tables to be joined to answer

user query

Info. Need

Page 8: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Automatic Data Integration

3

Tables(Data Sources)

One of the few tables to be joined to answer

user query

Schema Matching(with errors)

Info. Need

Page 9: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Automatic Data Integration

3

Tables(Data Sources)

New Table

One of the few tables to be joined to answer

user query

Schema Matching(with errors)

Info. Need

Page 10: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Automatic Data Integration

3

Tables(Data Sources)

New Table

One of the few tables to be joined to answer

user query

Schema Matching(with errors)

Info. Need

Page 11: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Automatic Data Integration

3

Tables(Data Sources)

End GoalTo be able to pose integrative queries against a

growing heterogeneous dataset and get meaningful answer.

New Table

One of the few tables to be joined to answer

user query

Schema Matching(with errors)

Info. Need

Page 12: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

The Reality Today

4

Page 13: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

The Reality Today

• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings

• Fix errors

– Create query forms• Fix errors revealed by bad data

4

Page 14: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

The Reality Today

• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings

• Fix errors

– Create query forms• Fix errors revealed by bad data

• But this doesn’t work well for discovery (ad hoc) queries, e.g., in science– Too many sources, queries to administer– Mistakes not revealed until queries posed– Too many attributes for pairwise schema matching

4

Page 15: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

Page 16: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

Page 17: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

Page 18: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

Page 19: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

Page 20: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

Page 21: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

1. Discovering Schema Matches

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

Page 22: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Discovering Schema Matches

Schema Matchers

Page 23: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Discovering Schema Matches

• Metadata Level– COMA++ [Do and Rahm, 2007]

• pairwise column comparisons necessary

COMA++

Schema Matchers

Page 24: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Discovering Schema Matches

• Metadata Level– COMA++ [Do and Rahm, 2007]

• pairwise column comparisons necessary

• Instance Level– Based on Modified Adsorption (MAD) [next slide]

• random-walk inspired, previously used in NLP problems• pairwise column comparisons not necessary • parallelizable, suitable for large datasets

COMA++

.

.

.

MAD

Schema Matchers

Page 25: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

Page 26: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

Value Node

Attribute Node

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36

Page 27: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36

L1 L2 L3 L4 L5

Seed Label (unique)

Page 28: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36

L1 L2 L3 L4 L5

L3L1

Seed Label (unique)

Page 29: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36

L1 L2 L3 L4 L5

L3L1

L1L3

L4L5

L5L4L2

Seed Label (unique)

All Labels Propagated in Parallel by MAD

Page 30: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

2. Correcting Matching Errors

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

Page 31: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

2. Correcting Matching Errors

10

P b

M

G

c

d

Page 32: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Keyword Matched Sources in Q’s Schema Graph

2. Correcting Matching Errors

10

P b

M

G

c

d

Schema Graph

Page 33: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Keyword Matched Sources in Q’s Schema Graph

2. Correcting Matching Errors

10

0.07

0.1

0.1

0.1

0.04

0.1

P b

M

G

c

d

Schema Graph

Edge Cost Encodes User Preference (lower is better)

Page 34: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Keyword Matched Sources in Q’s Schema Graph

2. Correcting Matching Errors

10

0.07

0.1

0.1

0.1

0.04

0.1

P b

M

G

c

d

Schema Graph

Edge Cost Encodes User Preference (lower is better)

Matching Error: How can we assign it higher (worse) cost?

Page 35: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Correcting Error: Learning New Edge Costs

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Page 36: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Correcting Error: Learning New Edge Costs

Query

Query*

.

.

.

Query

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Page 37: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Correcting Error: Learning New Edge Costs

Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Page 38: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Correcting Error: Learning New Edge Costs

Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

feedback on answers, which is what the user cares about

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41

Page 39: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Correcting Error: Learning New Edge Costs

Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

updated cost

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

0.5

11

Cost= 0.41

Cost= 0.8

Page 40: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Decomposition of Edge Cost

12

TABLE1 TABLE 2

Page 41: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Decomposition of Edge Cost

12

FeatureName

Matching Cost

Coefficient(Values Learned)

COMA++ Matched 0.90 wCOMA++

MAD Matched 0.7 wLP

--- --- ---

TABLE1 TABLE 2

Page 42: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Decomposition of Edge Cost

12

FeatureName

Matching Cost

Coefficient(Values Learned)

COMA++ Matched 0.90 wCOMA++

MAD Matched 0.7 wLP

--- --- ---

TABLE1 TABLE 2

Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP

Page 43: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Decomposition of Edge Cost

12

FeatureName

Matching Cost

Coefficient(Values Learned)

COMA++ Matched 0.90 wCOMA++

MAD Matched 0.7 wLP

--- --- --- Learned

TABLE1 TABLE 2

Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP

Page 44: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Learning: Incorporating User Feedback

• Model feedback incorporation as a constrained optimization problem.

13

Page 45: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Learning: Incorporating User Feedback

MIRA Algorithm(Crammer et al., 2006)

• Model feedback incorporation as a constrained optimization problem.

13

Page 46: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Learning: Incorporating User Feedback

MIRA Algorithm(Crammer et al., 2006)

• Model feedback incorporation as a constrained optimization problem.

13

New Model

Parameters

CurrentModel

Parameters

Page 47: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Learning: Incorporating User Feedback

MIRA Algorithm(Crammer et al., 2006)

• Model feedback incorporation as a constrained optimization problem.

13

New Model

Parameters

CurrentModel

Parameters

Tree Cost

Loss

Page 48: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Learning: Incorporating User Feedback

MIRA Algorithm(Crammer et al., 2006)

Tree whose tuples user likes

Tree whose tuples user doesn’t like.

• Model feedback incorporation as a constrained optimization problem.

13

New Model

Parameters

CurrentModel

Parameters

Tree Cost

Loss

Page 49: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

4

d

c

bP

M

G

Data Sources

3. Where to Align New Source?

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

Page 50: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

15

Where to Match a New Source?

3

Page 51: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

3

Keywords

Page 52: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

3

Neighborhood imposed by cost of

kth best answer.

Keywords

Page 53: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

?New

Source

3

Neighborhood imposed by cost of

kth best answer.

Keywords

Page 54: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

?New

Source

3

Matchings outside this neighborhood is not going to affect k-best answers

(i.e., current view).

Neighborhood imposed by cost of

kth best answer.

Keywords

Page 55: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.

Where to Match a New Source?

?New

Source

3

View Based AlignerConsider only those matchings which are likely to affect query

results, as otherwise there will be no feedback from user.

Matchings outside this neighborhood is not going to affect k-best answers

(i.e., current view).

Neighborhood imposed by cost of

kth best answer.

Keywords

Page 56: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Experiments

16

Page 57: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Experiments

Two questions:

16

Page 58: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Experiments

Two questions:I. Can we repair alignment errors by exploiting

user feedback over answers?

16

Page 59: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Experiments

Two questions:I. Can we repair alignment errors by exploiting

user feedback over answers?

II.Can we reduce the number of pairwise comparisons necessary during alignment discovery for new source?

16

Page 60: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Correcting Schema Matching Errors: Setup

17

Page 61: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

Page 62: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

• Start with just the tables

Page 63: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

• Start with just the tables

• Use automatic schema matchers (e.g., COMA++, MAD)

Page 64: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

• Start with just the tables

• Use automatic schema matchers (e.g., COMA++, MAD)

• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)

Page 65: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

1. Correcting Schema Matching Errors: Setup

17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings

• Start with just the tables

• Use automatic schema matchers (e.g., COMA++, MAD)

• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)

• Compute precision-recall w.r.t. the gold matchings (left figure)

Page 66: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

I. Correcting Schema Matching Errors

18

Page 67: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

I. Correcting Schema Matching Errors

18

0.25

0.438

0.625

0.813

1

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

Precision-Recall Plots for Various Methods

Pre

cisi

on

Recall

COMA++ MAD Q

Page 68: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

I. Correcting Schema Matching Errors

18

0.25

0.438

0.625

0.813

1

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

Precision-Recall Plots for Various Methods

Pre

cisi

on

Recall

COMA++ MAD Q

Learning with Q helps correct schema

matching errors.

Page 69: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

I. Correcting Schema Matching Errors (contd.)

19

Page 70: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

I. Correcting Schema Matching Errors (contd.)

19

0.15

1.363

2.575

3.788

5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Gold vs Non-Gold Edge Costs After Increasing Feedback

Ave

rage

Ed

ge C

ost

s (L

ow

er

is B

ett

er)

Feedback Step Number

Avg. Gold Edge CostAvg. Non-Gold Edge Cost

Page 71: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

I. Correcting Schema Matching Errors (contd.)

19

0.15

1.363

2.575

3.788

5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Gold vs Non-Gold Edge Costs After Increasing Feedback

Ave

rage

Ed

ge C

ost

s (L

ow

er

is B

ett

er)

Feedback Step Number

Avg. Gold Edge CostAvg. Non-Gold Edge Cost

Learning with Q helps identify the correct (gold) alignments.

Page 72: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

20

II. Reducing Pairwise Comparisons during New Source Integration

Page 73: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

20

II. Reducing Pairwise Comparisons during New Source Integration

0

5000.0

10000.0

15000.0

20000.0

18 100 500

# P

air

wis

e C

om

pari

son

s

Number of Tables in the Schema Graph

Exhaustive ViewBasedAligner

Page 74: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

20

II. Reducing Pairwise Comparisons during New Source Integration

0

5000.0

10000.0

15000.0

20000.0

18 100 500

# P

air

wis

e C

om

pari

son

s

Number of Tables in the Schema Graph

Exhaustive ViewBasedAligner

View Based Aligner Significantly Reduces the Number of Comparisons.

Page 75: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Related Work• B. Alexe, L. Chiticariu, R. J. Miller, and W.-C. Tan. Muse: Mapping

understand- ing and design by example. In ICDE 2008

• Laura Chiticariu, Phokion G. Kolaitis, Lucian Popa: "Interactive Generation of Integrated Schemas". SIGMOD Conference 2008

• Anish Das Sarma, Luna Dong, Alon Halevy. Bootstrapping Pay-As-You-Go Data Integration System. SIGMOD 2008

• Fagin+, Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling: Foundations and Applications 2009

• S.R. Jeffery, M.J. Franklin, and A.Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD, 2008

• Talukdar+. Learning to create data-integrating queries. In VLDB, 2008.

21

Page 76: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Summary

22

Page 77: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Summary

• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable

22

Page 78: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Summary

• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable

• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching

errors

22

Page 79: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Summary

• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable

• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching

errors

• Integrates new sources– through incremental updates to schema matchings

22

Page 80: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes
Page 81: Automatically Incorporating New Sources in Keyword-Search ... · based Data Integration SIGMOD 2010, June 9, 2010 ... – Mistakes not revealed until queries posed – Too many attributes

Thank You!

Poster: Tomorrow (Thu), 3:30pm Cosmopolitan AB