data integration with uncertainty xin (luna) dong data management dept @ at&t joint work w. mike...

Data Integration with Uncertainty

Xin (Luna) DongData Management Dept @ AT&T

Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang

(Berkeley), Cong Yu (Yahoo!)

Many Applications Need to Manage Heterogeneous Data Sources

D1

D2

D3

D4

D5

Bib(title, authors, conf, year)

Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)

Traditional Data Integration Systems [Ullman, 1997]

D1

D2

D3

D4

D5

Mediated Schema

MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid

Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)

Publication(title, author, conf, year)

Querying on Traditional Data Integration Systems

D1

D2

D3

D4

D5

Mediated Schema

QQQ

Q1

Q2Q4

Q

QQ

Q5

Q3

Uncertainty Can Occur at Three Levels in Data Integration Applications

D1

D2

D3

D4

D5

Mediated Schema

QQ

I. Data Level

II. Semantics Level

III. Query Level

Mediated SchemaP-Med-Schema

Q


D1

D2

D3

D4

D5

I. Probabilistic Model

PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

II. Keyword QueryReformulationIII. Top-K Query

Answering

Our Approach

Many applications require only approximate answers and do not need full integration.

Pay-as-you-go data integration [Halevy, et al., 2005] Handle uncertainty and provide best-effort services

from the outset Evolve mappings between data sources over time

This talk: build a completely self-configuring data integration system by applying probabilistic models.

P-Med-Schema

Q

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

Technical Contributions

1. Probabilistic schema mappings• Query answering w.r.t. pMs

[VLDB’07]• Automatic creation [Sigmod’08]

2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation

3. Probabilistic Integrity Constraints[WebDB’09]

• Automatic creation• An application

P-Med-Schema

Q

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

Roadmap

1. Probabilistic schema mappings• Query answering w.r.t. pMs

[VLDB’07]• Automatic creation [Sigmod’08]


3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation• An application

Example Probabilistic Mappings

Mediated Schema

T(name, email, mailing-addr, home-addr, office-addr)

Q: SELECT mailing-addr FROM T

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Q3: SELECT email-addr FROM S

PName Email-addr Current-addr Permanent-addr

Alice alice@ Mountain View Sunnyvale

Bob bob@ Sunnyvale Sunnyvale

S

T(name, email, mailing-addr, home-addr, office-addr) 0.5

S(pname, email-addr, current-addr, permanent-addr)

m1



m2



m3

Tuple Prob

(‘Sunnyvale’)

(‘Mountain View’)

(‘alice@’)

(‘bob@’)

0.9

0.5

0.1

0.1



Mediated Schema

T(name, email, mailing-addr, home-addr, office-addr)

Q: SELECT mailing-addr FROM T



Q3: SELECT email-addr FROM S

PName Email-addr Current-addr Permanent-addr



S



m1



m2



m3

Tuple Prob

(‘Sunnyvale’)


(‘alice@’)

(‘bob@’)

0.9

0.5

0.1

0.1

Top-k Query Answering w.r.t. Probabilistic Mappings

Tuple Prob

(‘Sunnyvale’)(‘Mountain View’)

(‘alice@’)

(‘bob@’)

0.90.5

0.1

0.1

Schema Mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

Mappings: one-to-one schema matching Queries: select-project-join queries

Probabilistic Mapping

S=(pname, email-addr, home-addr, office-addr)


Possible MappingProbabil

ity{(pname,name),(home-addr, mailing-addr)}

0.5

{(pname,name),(office-addr, mailing-addr)}

0.4

{(pname,name),(email-addr, mailing-addr)}

0.1

By-Table v.s. By-Tuple Semantics


pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale


Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

Alice Sunnyvale

Bob Sunnyvale

name

mailing-addr

Alice alice@

Bob bob@ Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1


pname

email-addr



ViewSunnyvale


Ds=

name

mailing-addr

AliceMountain

View

Bob bob@

DT=nam

emailing-

addr

AliceSunnyval

e

Bob bob@

name

mailing-addr

Alice alice@

Bob bob@ Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01

…

By-Table Query Answering

pname

email-addr



ViewSunnyvale


Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

AliceSunnyval

e

BobSunnyval

e

name

mailing-addr

Alice alice@

Bob bob@ 0.5 0.4 0.1

SELECT mailing-addr

FROM T

Tuple Probability

(‘Sunnyvale’) 0.9

(‘Mountain View’) 0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

By-Tuple Query Answering

pname

email-addr



ViewSunnyvale


Ds=

name

mailing-addr

AliceMountain

View

Bob bob@

DT=nam

emailing-

addr

AliceSunnyval

e

Bob bob@

name

mailing-addr

Alice alice@

Bob bob@ 0.05 0.04 0.01

Tuple Probability


(‘Mountain View’) 0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

…

SELECT mailing-addr

FROM T

Complexity of Query Answering

By-table By-tupleData Complexity PTIME #P-

complete

Mapping Complexity

PTIME PTIME

By-Table Query Answering

SELECT mailing-addr FROM T

SELECT home-addr FROM S (0.5)

SELECT office-addr FROM S (0.4)

SELECT email-addr FROM S (0.1)

Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping

Query Rewriting

By-Tuple Query Answering

pname email-addr home-addr office-addr


Bob bob@ Sunnyvale San Jose

Ds=

DT=

name

mailing-addr

Alice Sunnyvale

Bob Sunnyvale0.2

…

SELECT mailing-addr

FROM T

Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping

Target Enumeration name

mailing-addr

Alice Sunnyvale

Bob @bob0.04

name mailing-addr

Alice alice@

Bob [email protected]

name mailing-addr

AliceMountain

View

Bob @bob0.05

More on By-Tuple Query Answering The high complexity comes from computing

probabilitiesTheorem: Even computing the probability for

one possible answer is #P-hardTheorem: Computing all possible answers w/o

probabilities is in PTIME In general by-tuple query answering cannot be

done by query rewriting There are two subsets of queries that can be

answered in PTIME by query rewriting

PTIME for Queries with a Single P-Mapping Target

Answers


(‘Sunnyvale’)


SELECT home-addr FROM S

SELECT office-addr FROM S

SELECT email-addr FROM S

Answers

(‘Sunnyvale’)

(‘Sunnyvale’)

Answers

(‘alice@’)

(‘bob@’)

Pr=0.5 Pr=0.4 Pr=0.1

TupleProbabil

ity



0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

=1-(1-0.4)(1-0.5-0.4)=1-(1-0.5)(1-0)

PTIME for Queries that Return Join Attributes

SELECT mailing-addr FROM T,VWHERE T.mailing-addr = V.hightech


SELECT hightechFROM V

TupleProbabil

ity



0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

=0.94*0.8

TupleProbabil

ity



0.8

TupleProbabil

ity



0.4

Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes

SELECT ‘true’ FROM T,VWHERE T.mailing-addr = V.hightech


SELECT hightechFROM V

TupleProbabil

ity



0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

≠0.94*0.8+0.5*0.8

TupleProbabil

ity



0.8

TupleProbabilit

y

(‘true’) 0.864

Extensions to More Expressive Mappings The complexity results for query answering carry over to

three extensions to more expressive mappings Complex mappings

E.g., address street, city and state GLAV mappings

E.g., Paper Authorship Author Publication Conditional mappings:

E.g., if age>65, Pr(home-addr mailing-addr)=0.8 if age<=65, Pr(home-addr mailing-addr)=0.5

Creation: 1) Matching Attributes Current schema matching techniques can

compute similarity between source attributes and target attributes---Weighted attribute correspondences

Goal: find a p-mapping that is consistent w. a set of weighted correspondences

S=(num, pname, home-addr, office-addr)


0.8 0.9 0.90.2

Creation: 1) Matching Attributes



0.8 0.9 0.90.2

(num, pname, home-addr, office-addr)

(name, mailing-addr)p=.4







Creation: 2) Normalization Theorem: Let C be a set of weighted

correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.

Approach: Normalization



0.8 0.9 0.90.2

Creation: 2) Normalization Theorem: Let C be a set of weighted

correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.

Approach: Normalization



0.8 0.5 0.50.2

Creation: 2) Normalization



0.8 0.5 0.50.2









Creation: 3) Generating p-Mappings However, different p-mappings can be consistent

w. the same set of weighted correspondences.Solution: Choose the p-mapping w. the maximum

entropy.Algorithm:

Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.

Solve the following optimization problem:

Creation: 3) Generating p-Mappings



0.8 0.5 0.50.2












0.8 0.5 0.50.2







Creation: 3) Generating p-Mappings However, different p-mappings can be consistent

w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum

entropy. Algorithm:

Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.

Solve the following optimization problem:




0.8 0.5 0.50.2









Entropy = 0.52 (v.s. 0.45)

Related Work Approximate schema mappings

Using top-k schema mappings can increase the recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007]

Generating top-k schema mappings by combining results by various matchers [Gal, 2007]

P-Med-Schema

Q

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

Roadmap

1. Probabilistic schema mappings• Query answering w.r.t. pMs [VLDB’07]• Automatic creation [Sigmod’08]

2. Probabilistic mediated schema[Sigmod’08]

• Expressiveness• Automatic creation

3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation• An application

An Example Mediated Schema

S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr)

Med-S (name, email, phone, addr)

{name, person-name}

{phone-num, phone}

{address,mailing-addr}

{email}

A mediated schema can be considered as a clustering of important attributes in source schemas

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})

Why Probabilistic Mediated Schema?

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

?

Q: SELECT name, hPhone, oPhone FROM Med




Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})

Q: SELECT name, phone, address FROM Med

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})






Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})







Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})








Probabilistic Mediated Schema


Our Solution



Pr=.5

Pr=.5

P-Mappings w.r.t. P-Med-Schema

PM1

Med3 (name, hPP, oP, hAA, oA)

S1(name, hP, oP, hA, oA)Pr=.64







PM2

Med4 (name, oPP, hP, oAA, hA)








Query Answeringw.r.t. P-Mappings & P-Med-Schema

SELECT name, phone, address

FROM Med-S

TupleProbabilit

y

(‘Alice’, ‘123-4567’, ‘123 A Ave.’)

0.34

(‘Alice’, ‘765-4321’, ‘456 B Ave.’)

0.34

(‘Alice’, ‘765-4321’, ‘123 A Ave.’)

0.16

(‘Alice’, ‘123-4567’, ‘456 B Ave.’)

0.16

name hPhone oPhone hAddr oAddr

Alice 123-4567 765-4321 123, A Ave. 456, B Ave

S1

Q

Answers

Expressive Power of P-Med-Schema v.s. P-Mapping

Theorem 1. For one-to-many mappings(p-med-schema + p-mappings)

= (mediated schema + p-mapping)

> (p-med-schema + mappings) Theorem 2. When restricted to one-to-one

mappings:(p-med-schema + p-mappings)

> (mediated schema + p-mapping)

Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn

Output: Single-table mediated schema M

Algorithm1. Remove all infrequent attributes

2. Find similarity between every pair of attributes and construct a weighted graph

3. Remove edges with weight below τ (e.g., τ=.5)

4. Each connected com-ponent is a cluster

S2

S1

name address

email-address

pname home-address

1

.6

.6

.2

Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn

Output: Single-table mediated schema M

Algorithm1. Remove all infrequent attributes

2. Find similarity between every pair of attributes and construct a weighted graph

3. Remove edges with weight below τ (e.g., τ=.5)

4. Each connected com-ponent is a cluster

S2

S1

name address

email-address

pname home-address

1

.6

.6

Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and

construct a weighted graph3. For each edge

• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)

uncertain edge(e.g., τ =.6, ϵ = .2)

Clustering for each comboof including/excluding uncertain edges

S2

S1

name address

email-address

pname home-address

1

.6

.6

.2






S2

S1

name address

email-address

pname home-address

1

.6

.6






S2

S1

name address

email-address

pname home-address

1

.6






S2

S1

name address

email-address

pname home-address

1 .6






S2

S1

name address

email-address

pname home-address

1

Creation: 3) Computing Probabilities Mediated schema M and source S are consistent if no two

attributes of S are grouped into same cluster in M

S2

S1name address

email-address

pname home-addressS2

S1name address

email-address

pname home-address

S2

S1name address

email-address


S1name address

email-address

pname home-address

Creation: 3) Computing Probabilities Assign probabilities to each M proportional to the

number of sources it is consistent with.

S2

S1name address

email-address


S1name address

email-address

pname home-address

S2

S1name address

email-address


S1name address

email-address

pname home-address

Pr=1/6 Pr=1/6

Pr=1/3 Pr=1/3

Real Example (650 bib sources)

Consolidation of Possible Med-Schemas Why?

Users expect to see a single mediated schema More efficient query processing

Consolidate possible med-schemas Two attrs are in the same cluster only if they are in the

same cluster in each possible med-schema I.e., Keep only certain edges and do clustering

Consolidate p-mappings Convert each p-mapping to a p-mapping to the

consolidated med-schema Assign probabilities accordingly

Probabilistic Mediated Schema


Consolidated Med-Schema



Pr=.5

Pr=.5

Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})

Consolidated P-Mapping

Med (name, P, hP, oP, A, hA, oA)
















Higher pr for right

correlations

One-to-many mappings

Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})


Related Work Automatic generation of mediated

schemasFinding a generative model [He&Chang,

2003]

P-Med-Schema

Q

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

Roadmap

1. Probabilistic schema mappings• Query answering w.r.t. pMs [VLDB’07]• Automatic creation [Sigmod’08]


3. Probabilistic Integrity Constraints [WebDB’09]

• Automatic creation• An application

Functional Dependency Functional dependency:

: a set of attributes : an attributeE.g., SSN name

ApplicationsData cleaningMediated-schema normalization

Normalizing the P-Med-Schema

Probabilistic Functional Dependency Uncertainty

Noisy data false negativeBias false positive Incorrect schema mappings Incorrect instance mappings

Probabilistic functional dependency:

p: likelihood of the FD being correct Apply probabilistic analysis.

Related Work Approximate functional dependencies:

generating approximate FDs in a statistical way from a single data sourceTANE(1999)CORDS(2004)

S2S5S3

S4

Architecture of the UDI System

UDI

S1

Schema

Constraint Discovery

Data

Attribute Matching

Attribute Clustering

Attr sim

Schema Mapping

P-med-schema Attr sim

Schema Consolidation

P-med-schema P-mappings

Normalized schema & P-mappings

Schema Normalization

Consolidated schemaPFDs

Query Rewriting

Query

Query Answering

Rewritten Query

Data

Answers

Mappings

Experimental Settings (I) Data: tables extracted from HTML tables on the web

Domain #Src Search Keywords

Movie 161 movie, year

Car 817 make, model

People 49job/title, organization/company/employer

Course 647course/class, instructor/teacher/lecturer, subject/department/title

Bib 649 author, title, year, journal/conference

Experimental Settings (II) Queries: 10 SQL queries for each domain

SELECT clause contains 1 to 4 attributes WHERE clause contains 0 to 3 predicates

[attr] =|<>|<|<=|>|>=|LIKE [const] Selectivity and likelihood of correct mapping vary

Performance measurePrecision Recall F-measure

#(correct answers) #(correct answers) 2·Precision·Recall #(returned answers) #(real answers) Precision + Recall

Golden standard: Generated by manually creating med-schema and mappings

Approximate standard: Generated by manually removing incorrect answers from those by UDI and by directly querying each data source

Quality of Query Answering

Domain Precision Recall F-measure

Golden Standard

People 1 .849 .918

Course 1 .852 .92

Approximate Golden Standard

Movie .95 1 .924

Car 1 .917 .957

People .958 .984 .971

Course 1 1 1

Bib 1 .955 .977

Comparison w. Other Integration Approaches

Keyword search is limited and obtained both low precision and low recall.

Querying the sources directly or considering only the mapping w. the highest probability obtained low recall.

UDI obtained highest F-measure in all domains.

Comparison w. Other Mediated-Schema Generation Methods

Using p-med-schema obtained highest F-measure in all domains.

Simply unioning all source attributes as the mediated schema obtained a low recallGenerating a single

mediated schema obtained similar F-measures.

P-Med-Schema v.s. Single Mediated Schema

People domain

System Setup Time

Car domain

Conclusions Contributions

Many applications require data integration w. uncertainty

Applying probabilistic models to do this in a principled way:

P-mapping, P-med-schema, P-functional-dependency Build the first completely self-configuring DI system

Future work Use the probabilistic model to improve schema

mappings over time Incorporate uncertainty on keyword reformulation and

dirty data Consider dependence between data sources

Acknowledgement Project participants

Alon Halevy (Google, UW) Anish Das Sarma (Stanford) Daisy Zhe Wang (Berkeley) Cong Yu (Yahoo!)

Other people for discussion Divesh Srivastava, Patrick Haffner (AT&T) Dan Suciu (Univ. of Washington) Jayant Madhavan, Omar Benjelloun, Shawn Jefferey,

etc. (Google)


Xin (Luna) DongData Management Dept @ AT&T

Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang

(Berkeley), Cong Yu (Yahoo!)

data integration with uncertainty xin (luna) dong data management dept @ at&t joint work w. mike...

Documents