data integration with uncertainty xin (luna) dong data management dept @ at&t joint work w. mike...

80
Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang (Berkeley), Cong Yu (Yahoo!)

Upload: julian-collins

Post on 18-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Data Integration with Uncertainty

Xin (Luna) DongData Management Dept @ AT&T

Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang

(Berkeley), Cong Yu (Yahoo!)

Page 2: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Many Applications Need to Manage Heterogeneous Data Sources

D1

D2

D3

D4

D5

Bib(title, authors, conf, year)

Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)

Page 3: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Traditional Data Integration Systems [Ullman, 1997]

D1

D2

D3

D4

D5

Mediated Schema

MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid

Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)

Publication(title, author, conf, year)

Page 4: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Querying on Traditional Data Integration Systems

D1

D2

D3

D4

D5

Mediated Schema

QQQ

Q1

Q2Q4

Q

QQ

Q5

Q3

Page 5: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Uncertainty Can Occur at Three Levels in Data Integration Applications

D1

D2

D3

D4

D5

Mediated Schema

QQ

I. Data Level

II. Semantics Level

III. Query Level

Page 6: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Mediated SchemaP-Med-Schema

Q

Data Integration with Uncertainty

D1

D2

D3

D4

D5

I. Probabilistic Model

PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

II. Keyword QueryReformulationIII. Top-K Query

Answering

Page 7: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Our Approach

Many applications require only approximate answers and do not need full integration.

Pay-as-you-go data integration [Halevy, et al., 2005] Handle uncertainty and provide best-effort services

from the outset Evolve mappings between data sources over time

This talk: build a completely self-configuring data integration system by applying probabilistic models.

Page 8: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

P-Med-Schema

Q

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

Technical Contributions

1. Probabilistic schema mappings• Query answering w.r.t. pMs

[VLDB’07]• Automatic creation [Sigmod’08]

2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation

3. Probabilistic Integrity Constraints[WebDB’09]

• Automatic creation• An application

Page 9: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

P-Med-Schema

Q

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

Roadmap

1. Probabilistic schema mappings• Query answering w.r.t. pMs

[VLDB’07]• Automatic creation [Sigmod’08]

2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation

3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation• An application

Page 10: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Example Probabilistic Mappings

Mediated Schema

T(name, email, mailing-addr, home-addr, office-addr)

Q: SELECT mailing-addr FROM T

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Q3: SELECT email-addr FROM S

PName Email-addr Current-addr Permanent-addr

Alice alice@ Mountain View Sunnyvale

Bob bob@ Sunnyvale Sunnyvale

S

T(name, email, mailing-addr, home-addr, office-addr) 0.5

S(pname, email-addr, current-addr, permanent-addr)

m1

T(name, email, mailing-addr, home-addr, office-addr) 0.4

S(pname, email-addr, current-addr, permanent-addr)

m2

T(name, email, mailing-addr, home-addr, office-addr) 0.1

S(pname, email-addr, current-addr, permanent-addr)

m3

Tuple Prob

(‘Sunnyvale’)

(‘Mountain View’)

(‘alice@’)

(‘bob@’)

0.9

0.5

0.1

0.1

Page 11: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Mediated Schema

T(name, email, mailing-addr, home-addr, office-addr)

Q: SELECT mailing-addr FROM T

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Q3: SELECT email-addr FROM S

PName Email-addr Current-addr Permanent-addr

Alice alice@ Mountain View Sunnyvale

Bob bob@ Sunnyvale Sunnyvale

S

T(name, email, mailing-addr, home-addr, office-addr) 0.5

S(pname, email-addr, current-addr, permanent-addr)

m1

T(name, email, mailing-addr, home-addr, office-addr) 0.4

S(pname, email-addr, current-addr, permanent-addr)

m2

T(name, email, mailing-addr, home-addr, office-addr) 0.1

S(pname, email-addr, current-addr, permanent-addr)

m3

Tuple Prob

(‘Sunnyvale’)

(‘Mountain View’)

(‘alice@’)

(‘bob@’)

0.9

0.5

0.1

0.1

Top-k Query Answering w.r.t. Probabilistic Mappings

Tuple Prob

(‘Sunnyvale’)(‘Mountain View’)

(‘alice@’)

(‘bob@’)

0.90.5

0.1

0.1

Page 12: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Schema Mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

Mappings: one-to-one schema matching Queries: select-project-join queries

Page 13: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Probabilistic Mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

Possible MappingProbabil

ity{(pname,name),(home-addr, mailing-addr)}

0.5

{(pname,name),(office-addr, mailing-addr)}

0.4

{(pname,name),(email-addr, mailing-addr)}

0.1

Page 14: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

By-Table v.s. By-Tuple Semantics

Page 15: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

By-Table v.s. By-Tuple Semantics

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale Sunnyvale

Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

Alice Sunnyvale

Bob Sunnyvale

name

mailing-addr

Alice alice@

Bob bob@ Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1

Page 16: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

By-Table v.s. By-Tuple Semantics

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale Sunnyvale

Ds=

name

mailing-addr

AliceMountain

View

Bob bob@

DT=nam

emailing-

addr

AliceSunnyval

e

Bob bob@

name

mailing-addr

Alice alice@

Bob bob@ Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01

Page 17: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

By-Table Query Answering

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale Sunnyvale

Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

AliceSunnyval

e

BobSunnyval

e

name

mailing-addr

Alice alice@

Bob bob@ 0.5 0.4 0.1

SELECT mailing-addr

FROM T

Tuple Probability

(‘Sunnyvale’) 0.9

(‘Mountain View’) 0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

Page 18: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

By-Tuple Query Answering

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale Sunnyvale

Ds=

name

mailing-addr

AliceMountain

View

Bob bob@

DT=nam

emailing-

addr

AliceSunnyval

e

Bob bob@

name

mailing-addr

Alice alice@

Bob bob@ 0.05 0.04 0.01

Tuple Probability

(‘Sunnyvale’) 0.94

(‘Mountain View’) 0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

SELECT mailing-addr

FROM T

Page 19: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Complexity of Query Answering

By-table By-tupleData Complexity PTIME #P-

complete

Mapping Complexity

PTIME PTIME

Page 20: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

By-Table Query Answering

SELECT mailing-addr FROM T

SELECT home-addr FROM S (0.5)

SELECT office-addr FROM S (0.4)

SELECT email-addr FROM S (0.1)

Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping

Query Rewriting

Page 21: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

By-Tuple Query Answering

pname email-addr home-addr office-addr

Alice alice@ Mountain View Sunnyvale

Bob bob@ Sunnyvale San Jose

Ds=

DT=

name

mailing-addr

Alice Sunnyvale

Bob Sunnyvale0.2

SELECT mailing-addr

FROM T

Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping

Target Enumeration name

mailing-addr

Alice Sunnyvale

Bob @bob0.04

name mailing-addr

Alice alice@

Bob [email protected]

name mailing-addr

AliceMountain

View

Bob @bob0.05

Page 22: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

More on By-Tuple Query Answering The high complexity comes from computing

probabilitiesTheorem: Even computing the probability for

one possible answer is #P-hardTheorem: Computing all possible answers w/o

probabilities is in PTIME In general by-tuple query answering cannot be

done by query rewriting There are two subsets of queries that can be

answered in PTIME by query rewriting

Page 23: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

PTIME for Queries with a Single P-Mapping Target

Answers

(‘Mountain View’)

(‘Sunnyvale’)

SELECT mailing-addr FROM T

SELECT home-addr FROM S

SELECT office-addr FROM S

SELECT email-addr FROM S

Answers

(‘Sunnyvale’)

(‘Sunnyvale’)

Answers

(‘alice@’)

(‘bob@’)

Pr=0.5 Pr=0.4 Pr=0.1

TupleProbabil

ity

(‘Sunnyvale’) 0.94

(‘Mountain View’)

0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

=1-(1-0.4)(1-0.5-0.4)=1-(1-0.5)(1-0)

Page 24: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

PTIME for Queries that Return Join Attributes

SELECT mailing-addr FROM T,VWHERE T.mailing-addr = V.hightech

SELECT mailing-addr FROM T

SELECT hightechFROM V

TupleProbabil

ity

(‘Sunnyvale’) 0.94

(‘Mountain View’)

0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

=0.94*0.8

TupleProbabil

ity

(‘Sunnyvale’) 0.8

(‘Mountain View’)

0.8

TupleProbabil

ity

(‘Sunnyvale’) 0.752

(‘Mountain View’)

0.4

Page 25: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes

SELECT ‘true’ FROM T,VWHERE T.mailing-addr = V.hightech

SELECT mailing-addr FROM T

SELECT hightechFROM V

TupleProbabil

ity

(‘Sunnyvale’) 0.94

(‘Mountain View’)

0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

≠0.94*0.8+0.5*0.8

TupleProbabil

ity

(‘Sunnyvale’) 0.8

(‘Mountain View’)

0.8

TupleProbabilit

y

(‘true’) 0.864

Page 26: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Extensions to More Expressive Mappings The complexity results for query answering carry over to

three extensions to more expressive mappings Complex mappings

E.g., address street, city and state GLAV mappings

E.g., Paper Authorship Author Publication Conditional mappings:

E.g., if age>65, Pr(home-addr mailing-addr)=0.8 if age<=65, Pr(home-addr mailing-addr)=0.5

Page 27: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 1) Matching Attributes Current schema matching techniques can

compute similarity between source attributes and target attributes---Weighted attribute correspondences

Goal: find a p-mapping that is consistent w. a set of weighted correspondences

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.9 0.90.2

Page 28: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 1) Matching Attributes

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.9 0.90.2

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.1

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.1

Page 29: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Normalization Theorem: Let C be a set of weighted

correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.

Approach: Normalization

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.9 0.90.2

Page 30: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Normalization Theorem: Let C be a set of weighted

correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.

Approach: Normalization

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.5 0.50.2

Page 31: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Normalization

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.5 0.50.2

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.1

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.1

Page 32: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 3) Generating p-Mappings However, different p-mappings can be consistent

w. the same set of weighted correspondences.Solution: Choose the p-mapping w. the maximum

entropy.Algorithm:

Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.

Solve the following optimization problem:

Page 33: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 3) Generating p-Mappings

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.5 0.50.2

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.1

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.1

Page 34: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 3) Generating p-Mappings

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.5 0.50.2

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.3

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.5

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.2

Page 35: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 3) Generating p-Mappings However, different p-mappings can be consistent

w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum

entropy. Algorithm:

Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.

Solve the following optimization problem:

Page 36: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 3) Generating p-Mappings

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.5 0.50.2

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.1

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.1

Entropy = 0.52 (v.s. 0.45)

Page 37: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Related Work Approximate schema mappings

Using top-k schema mappings can increase the recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007]

Generating top-k schema mappings by combining results by various matchers [Gal, 2007]

Page 38: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

P-Med-Schema

Q

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

Roadmap

1. Probabilistic schema mappings• Query answering w.r.t. pMs [VLDB’07]• Automatic creation [Sigmod’08]

2. Probabilistic mediated schema[Sigmod’08]

• Expressiveness• Automatic creation

3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation• An application

Page 39: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

An Example Mediated Schema

S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr)

Med-S (name, email, phone, addr)

{name, person-name}

{phone-num, phone}

{address,mailing-addr}

{email}

A mediated schema can be considered as a clustering of important attributes in source schemas

Page 40: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})

Why Probabilistic Mediated Schema?

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

?

Q: SELECT name, hPhone, oPhone FROM Med

Page 41: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Why Probabilistic Mediated Schema?

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})

Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})

Q: SELECT name, phone, address FROM Med

Page 42: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})

Why Probabilistic Mediated Schema?

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})

Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})

Q: SELECT name, phone, address FROM Med

Page 43: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})

Why Probabilistic Mediated Schema?

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})

Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})

Q: SELECT name, phone, address FROM Med

Page 44: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})

Why Probabilistic Mediated Schema?

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})

Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})

Q: SELECT name, phone, address FROM Med

Page 45: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})

Why Probabilistic Mediated Schema?

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})

Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})

Q: SELECT name, phone, address FROM Med

Page 46: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Probabilistic Mediated Schema

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})

Our Solution

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})

Pr=.5

Pr=.5

Page 47: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

P-Mappings w.r.t. P-Med-Schema

PM1

Med3 (name, hPP, oP, hAA, oA)

S1(name, hP, oP, hA, oA)Pr=.64

Med3 (name, hPP, oP, hAA, oA)

S1(name, hP, oP, hA, oA)Pr=.16

Med3 (name, hPP, oP, hAA, oA)

S1(name, hP, oP, hA, oA)Pr=.16

Med3 (name, hPP, oP, hAA, oA)

S1(name, hP, oP, hA, oA)Pr=.04

PM2

Med4 (name, oPP, hP, oAA, hA)

S1(name, hP, oP, hA, oA)Pr=.64

Med4 (name, oPP, hP, oAA, hA)

S1(name, hP, oP, hA, oA)Pr=.16

Med4 (name, oPP, hP, oAA, hA)

S1(name, hP, oP, hA, oA)Pr=.16

Med4 (name, oPP, hP, oAA, hA)

S1(name, hP, oP, hA, oA)Pr=.04

Page 48: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Query Answeringw.r.t. P-Mappings & P-Med-Schema

SELECT name, phone, address

FROM Med-S

TupleProbabilit

y

(‘Alice’, ‘123-4567’, ‘123 A Ave.’)

0.34

(‘Alice’, ‘765-4321’, ‘456 B Ave.’)

0.34

(‘Alice’, ‘765-4321’, ‘123 A Ave.’)

0.16

(‘Alice’, ‘123-4567’, ‘456 B Ave.’)

0.16

name hPhone oPhone hAddr oAddr

Alice 123-4567 765-4321 123, A Ave. 456, B Ave

S1

Q

Answers

Page 49: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Expressive Power of P-Med-Schema v.s. P-Mapping

Theorem 1. For one-to-many mappings(p-med-schema + p-mappings)

= (mediated schema + p-mapping)

> (p-med-schema + mappings) Theorem 2. When restricted to one-to-one

mappings:(p-med-schema + p-mappings)

> (mediated schema + p-mapping)

Page 50: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn

Output: Single-table mediated schema M

Algorithm1. Remove all infrequent attributes

2. Find similarity between every pair of attributes and construct a weighted graph

3. Remove edges with weight below τ (e.g., τ=.5)

4. Each connected com-ponent is a cluster

S2

S1

name address

email-address

pname home-address

1

.6

.6

.2

Page 51: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn

Output: Single-table mediated schema M

Algorithm1. Remove all infrequent attributes

2. Find similarity between every pair of attributes and construct a weighted graph

3. Remove edges with weight below τ (e.g., τ=.5)

4. Each connected com-ponent is a cluster

S2

S1

name address

email-address

pname home-address

1

.6

.6

Page 52: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and

construct a weighted graph3. For each edge

• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)

uncertain edge(e.g., τ =.6, ϵ = .2)

Clustering for each comboof including/excluding uncertain edges

S2

S1

name address

email-address

pname home-address

1

.6

.6

.2

Page 53: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and

construct a weighted graph3. For each edge

• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)

uncertain edge(e.g., τ =.6, ϵ = .2)

Clustering for each comboof including/excluding uncertain edges

S2

S1

name address

email-address

pname home-address

1

.6

.6

Page 54: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and

construct a weighted graph3. For each edge

• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)

uncertain edge(e.g., τ =.6, ϵ = .2)

Clustering for each comboof including/excluding uncertain edges

S2

S1

name address

email-address

pname home-address

1

.6

.6

Page 55: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and

construct a weighted graph3. For each edge

• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)

uncertain edge(e.g., τ =.6, ϵ = .2)

Clustering for each comboof including/excluding uncertain edges

S2

S1

name address

email-address

pname home-address

1

.6

Page 56: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and

construct a weighted graph3. For each edge

• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)

uncertain edge(e.g., τ =.6, ϵ = .2)

Clustering for each comboof including/excluding uncertain edges

S2

S1

name address

email-address

pname home-address

1 .6

Page 57: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and

construct a weighted graph3. For each edge

• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)

uncertain edge(e.g., τ =.6, ϵ = .2)

Clustering for each comboof including/excluding uncertain edges

S2

S1

name address

email-address

pname home-address

1

Page 58: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 3) Computing Probabilities Mediated schema M and source S are consistent if no two

attributes of S are grouped into same cluster in M

S2

S1name address

email-address

pname home-addressS2

S1name address

email-address

pname home-address

S2

S1name address

email-address

pname home-addressS2

S1name address

email-address

pname home-address

Page 59: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Creation: 3) Computing Probabilities Assign probabilities to each M proportional to the

number of sources it is consistent with.

S2

S1name address

email-address

pname home-addressS2

S1name address

email-address

pname home-address

S2

S1name address

email-address

pname home-addressS2

S1name address

email-address

pname home-address

Pr=1/6 Pr=1/6

Pr=1/3 Pr=1/3

Page 60: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Real Example (650 bib sources)

Page 61: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Consolidation of Possible Med-Schemas Why?

Users expect to see a single mediated schema More efficient query processing

Consolidate possible med-schemas Two attrs are in the same cluster only if they are in the

same cluster in each possible med-schema I.e., Keep only certain edges and do clustering

Consolidate p-mappings Convert each p-mapping to a p-mapping to the

consolidated med-schema Assign probabilities accordingly

Page 62: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Probabilistic Mediated Schema

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})

Consolidated Med-Schema

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})

Pr=.5

Pr=.5

Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})

Page 63: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Consolidated P-Mapping

Med (name, P, hP, oP, A, hA, oA)

S1(name, hP, oP, hA, oA)Pr=.02

Med (name, P, hP, oP, A, hA, oA)

S1(name, hP, oP, hA, oA)Pr=.08

Med (name, P, hP, oP, A, hA, oA)

S1(name, hP, oP, hA, oA)Pr=.08

Med (name, P, hP, oP, A, hA, oA)

S1(name, hP, oP, hA, oA)Pr=.32

Med (name, P, hP, oP, A, hA, oA)

S1(name, hP, oP, hA, oA)Pr=.08

Med (name, P, hP, oP, A, hA, oA)

S1(name, hP, oP, hA, oA)Pr=.08

Med (name, P, hP, oP, A, hA, oA)

S1(name, hP, oP, hA, oA)Pr=.32

Med (name, P, hP, oP, A, hA, oA)

S1(name, hP, oP, hA, oA)Pr=.02

Higher pr for right

correlations

One-to-many mappings

Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

Page 64: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Related Work Automatic generation of mediated

schemasFinding a generative model [He&Chang,

2003]

Page 65: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

P-Med-Schema

Q

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

Roadmap

1. Probabilistic schema mappings• Query answering w.r.t. pMs [VLDB’07]• Automatic creation [Sigmod’08]

2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation

3. Probabilistic Integrity Constraints [WebDB’09]

• Automatic creation• An application

Page 66: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Functional Dependency Functional dependency:

: a set of attributes : an attributeE.g., SSN name

ApplicationsData cleaningMediated-schema normalization

Page 67: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Normalizing the P-Med-Schema

Page 68: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Probabilistic Functional Dependency Uncertainty

Noisy data false negativeBias false positive Incorrect schema mappings Incorrect instance mappings

Probabilistic functional dependency:

p: likelihood of the FD being correct Apply probabilistic analysis.

Page 69: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Related Work Approximate functional dependencies:

generating approximate FDs in a statistical way from a single data sourceTANE(1999)CORDS(2004)

Page 70: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

S2S5S3

S4

Architecture of the UDI System

UDI

S1

Schema

Constraint Discovery

Data

Attribute Matching

Attribute Clustering

Attr sim

Schema Mapping

P-med-schema Attr sim

Schema Consolidation

P-med-schema P-mappings

Normalized schema & P-mappings

Schema Normalization

Consolidated schemaPFDs

Query Rewriting

Query

Query Answering

Rewritten Query

Data

Answers

Mappings

Page 71: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Experimental Settings (I) Data: tables extracted from HTML tables on the web

Domain #Src Search Keywords

Movie 161 movie, year

Car 817 make, model

People 49job/title, organization/company/employer

Course 647course/class, instructor/teacher/lecturer, subject/department/title

Bib 649 author, title, year, journal/conference

Page 72: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Experimental Settings (II) Queries: 10 SQL queries for each domain

SELECT clause contains 1 to 4 attributes WHERE clause contains 0 to 3 predicates

[attr] =|<>|<|<=|>|>=|LIKE [const] Selectivity and likelihood of correct mapping vary

Performance measurePrecision Recall F-measure

#(correct answers) #(correct answers) 2·Precision·Recall #(returned answers) #(real answers) Precision + Recall

Golden standard: Generated by manually creating med-schema and mappings

Approximate standard: Generated by manually removing incorrect answers from those by UDI and by directly querying each data source

Page 73: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Quality of Query Answering

Domain Precision Recall F-measure

Golden Standard

People 1 .849 .918

Course 1 .852 .92

Approximate Golden Standard

Movie .95 1 .924

Car 1 .917 .957

People .958 .984 .971

Course 1 1 1

Bib 1 .955 .977

Page 74: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Comparison w. Other Integration Approaches

Keyword search is limited and obtained both low precision and low recall.

Querying the sources directly or considering only the mapping w. the highest probability obtained low recall.

UDI obtained highest F-measure in all domains.

Page 75: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Comparison w. Other Mediated-Schema Generation Methods

Using p-med-schema obtained highest F-measure in all domains.

Simply unioning all source attributes as the mediated schema obtained a low recallGenerating a single

mediated schema obtained similar F-measures.

Page 76: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

P-Med-Schema v.s. Single Mediated Schema

People domain

Page 77: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

System Setup Time

Car domain

Page 78: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Conclusions Contributions

Many applications require data integration w. uncertainty

Applying probabilistic models to do this in a principled way:

P-mapping, P-med-schema, P-functional-dependency Build the first completely self-configuring DI system

Future work Use the probabilistic model to improve schema

mappings over time Incorporate uncertainty on keyword reformulation and

dirty data Consider dependence between data sources

Page 79: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Acknowledgement Project participants

Alon Halevy (Google, UW) Anish Das Sarma (Stanford) Daisy Zhe Wang (Berkeley) Cong Yu (Yahoo!)

Other people for discussion Divesh Srivastava, Patrick Haffner (AT&T) Dan Suciu (Univ. of Washington) Jayant Madhavan, Omar Benjelloun, Shawn Jefferey,

etc. (Google)

Page 80: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma

Data Integration with Uncertainty

Xin (Luna) DongData Management Dept @ AT&T

Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang

(Berkeley), Cong Yu (Yahoo!)