data integration with uncertainty xin (luna) dong data management dept @ at&t joint work w. mike...
TRANSCRIPT
Data Integration with Uncertainty
Xin (Luna) DongData Management Dept @ AT&T
Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang
(Berkeley), Cong Yu (Yahoo!)
Many Applications Need to Manage Heterogeneous Data Sources
D1
D2
D3
D4
D5
Bib(title, authors, conf, year)
Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)
Traditional Data Integration Systems [Ullman, 1997]
D1
D2
D3
D4
D5
Mediated Schema
MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid
Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)
Publication(title, author, conf, year)
Querying on Traditional Data Integration Systems
D1
D2
D3
D4
D5
Mediated Schema
QQQ
Q1
Q2Q4
Q
Q5
Q3
Uncertainty Can Occur at Three Levels in Data Integration Applications
D1
D2
D3
D4
D5
Mediated Schema
I. Data Level
II. Semantics Level
III. Query Level
Mediated SchemaP-Med-Schema
Q
Data Integration with Uncertainty
D1
D2
D3
D4
D5
I. Probabilistic Model
PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
II. Keyword QueryReformulationIII. Top-K Query
Answering
Our Approach
Many applications require only approximate answers and do not need full integration.
Pay-as-you-go data integration [Halevy, et al., 2005] Handle uncertainty and provide best-effort services
from the outset Evolve mappings between data sources over time
This talk: build a completely self-configuring data integration system by applying probabilistic models.
P-Med-Schema
Q
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
Technical Contributions
1. Probabilistic schema mappings• Query answering w.r.t. pMs
[VLDB’07]• Automatic creation [Sigmod’08]
2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation
3. Probabilistic Integrity Constraints[WebDB’09]
• Automatic creation• An application
P-Med-Schema
Q
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
Roadmap
1. Probabilistic schema mappings• Query answering w.r.t. pMs
[VLDB’07]• Automatic creation [Sigmod’08]
2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation
3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation• An application
Example Probabilistic Mappings
Mediated Schema
T(name, email, mailing-addr, home-addr, office-addr)
Q: SELECT mailing-addr FROM T
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
PName Email-addr Current-addr Permanent-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale Sunnyvale
S
T(name, email, mailing-addr, home-addr, office-addr) 0.5
S(pname, email-addr, current-addr, permanent-addr)
m1
T(name, email, mailing-addr, home-addr, office-addr) 0.4
S(pname, email-addr, current-addr, permanent-addr)
m2
T(name, email, mailing-addr, home-addr, office-addr) 0.1
S(pname, email-addr, current-addr, permanent-addr)
m3
Tuple Prob
(‘Sunnyvale’)
(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.9
0.5
0.1
0.1
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Mediated Schema
T(name, email, mailing-addr, home-addr, office-addr)
Q: SELECT mailing-addr FROM T
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
PName Email-addr Current-addr Permanent-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale Sunnyvale
S
T(name, email, mailing-addr, home-addr, office-addr) 0.5
S(pname, email-addr, current-addr, permanent-addr)
m1
T(name, email, mailing-addr, home-addr, office-addr) 0.4
S(pname, email-addr, current-addr, permanent-addr)
m2
T(name, email, mailing-addr, home-addr, office-addr) 0.1
S(pname, email-addr, current-addr, permanent-addr)
m3
Tuple Prob
(‘Sunnyvale’)
(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.9
0.5
0.1
0.1
Top-k Query Answering w.r.t. Probabilistic Mappings
Tuple Prob
(‘Sunnyvale’)(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.90.5
0.1
0.1
Schema Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Mappings: one-to-one schema matching Queries: select-project-join queries
Probabilistic Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Possible MappingProbabil
ity{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1
By-Table v.s. By-Tuple Semantics
By-Table v.s. By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
Alice Sunnyvale
Bob Sunnyvale
name
mailing-addr
Alice alice@
Bob bob@ Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1
By-Table v.s. By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob bob@
DT=nam
emailing-
addr
AliceSunnyval
e
Bob bob@
name
mailing-addr
Alice alice@
Bob bob@ Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01
…
By-Table Query Answering
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
AliceSunnyval
e
BobSunnyval
e
name
mailing-addr
Alice alice@
Bob bob@ 0.5 0.4 0.1
SELECT mailing-addr
FROM T
Tuple Probability
(‘Sunnyvale’) 0.9
(‘Mountain View’) 0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
By-Tuple Query Answering
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob bob@
DT=nam
emailing-
addr
AliceSunnyval
e
Bob bob@
name
mailing-addr
Alice alice@
Bob bob@ 0.05 0.04 0.01
Tuple Probability
(‘Sunnyvale’) 0.94
(‘Mountain View’) 0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
…
SELECT mailing-addr
FROM T
Complexity of Query Answering
By-table By-tupleData Complexity PTIME #P-
complete
Mapping Complexity
PTIME PTIME
By-Table Query Answering
SELECT mailing-addr FROM T
SELECT home-addr FROM S (0.5)
SELECT office-addr FROM S (0.4)
SELECT email-addr FROM S (0.1)
Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping
Query Rewriting
By-Tuple Query Answering
pname email-addr home-addr office-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale San Jose
Ds=
DT=
name
mailing-addr
Alice Sunnyvale
Bob Sunnyvale0.2
…
SELECT mailing-addr
FROM T
Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping
Target Enumeration name
mailing-addr
Alice Sunnyvale
Bob @bob0.04
name mailing-addr
Alice alice@
name mailing-addr
AliceMountain
View
Bob @bob0.05
More on By-Tuple Query Answering The high complexity comes from computing
probabilitiesTheorem: Even computing the probability for
one possible answer is #P-hardTheorem: Computing all possible answers w/o
probabilities is in PTIME In general by-tuple query answering cannot be
done by query rewriting There are two subsets of queries that can be
answered in PTIME by query rewriting
PTIME for Queries with a Single P-Mapping Target
Answers
(‘Mountain View’)
(‘Sunnyvale’)
SELECT mailing-addr FROM T
SELECT home-addr FROM S
SELECT office-addr FROM S
SELECT email-addr FROM S
Answers
(‘Sunnyvale’)
(‘Sunnyvale’)
Answers
(‘alice@’)
(‘bob@’)
Pr=0.5 Pr=0.4 Pr=0.1
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
=1-(1-0.4)(1-0.5-0.4)=1-(1-0.5)(1-0)
PTIME for Queries that Return Join Attributes
SELECT mailing-addr FROM T,VWHERE T.mailing-addr = V.hightech
SELECT mailing-addr FROM T
SELECT hightechFROM V
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
=0.94*0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.8
(‘Mountain View’)
0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.752
(‘Mountain View’)
0.4
Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes
SELECT ‘true’ FROM T,VWHERE T.mailing-addr = V.hightech
SELECT mailing-addr FROM T
SELECT hightechFROM V
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
≠0.94*0.8+0.5*0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.8
(‘Mountain View’)
0.8
TupleProbabilit
y
(‘true’) 0.864
Extensions to More Expressive Mappings The complexity results for query answering carry over to
three extensions to more expressive mappings Complex mappings
E.g., address street, city and state GLAV mappings
E.g., Paper Authorship Author Publication Conditional mappings:
E.g., if age>65, Pr(home-addr mailing-addr)=0.8 if age<=65, Pr(home-addr mailing-addr)=0.5
Creation: 1) Matching Attributes Current schema matching techniques can
compute similarity between source attributes and target attributes---Weighted attribute correspondences
Goal: find a p-mapping that is consistent w. a set of weighted correspondences
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.9 0.90.2
Creation: 1) Matching Attributes
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.9 0.90.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
Creation: 2) Normalization Theorem: Let C be a set of weighted
correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.
Approach: Normalization
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.9 0.90.2
Creation: 2) Normalization Theorem: Let C be a set of weighted
correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.
Approach: Normalization
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
Creation: 2) Normalization
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
Creation: 3) Generating p-Mappings However, different p-mappings can be consistent
w. the same set of weighted correspondences.Solution: Choose the p-mapping w. the maximum
entropy.Algorithm:
Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.
Solve the following optimization problem:
Creation: 3) Generating p-Mappings
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
Creation: 3) Generating p-Mappings
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.3
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.5
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.2
Creation: 3) Generating p-Mappings However, different p-mappings can be consistent
w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum
entropy. Algorithm:
Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.
Solve the following optimization problem:
Creation: 3) Generating p-Mappings
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
Entropy = 0.52 (v.s. 0.45)
Related Work Approximate schema mappings
Using top-k schema mappings can increase the recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007]
Generating top-k schema mappings by combining results by various matchers [Gal, 2007]
P-Med-Schema
Q
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
Roadmap
1. Probabilistic schema mappings• Query answering w.r.t. pMs [VLDB’07]• Automatic creation [Sigmod’08]
2. Probabilistic mediated schema[Sigmod’08]
• Expressiveness• Automatic creation
3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation• An application
An Example Mediated Schema
S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr)
Med-S (name, email, phone, addr)
{name, person-name}
{phone-num, phone}
{address,mailing-addr}
{email}
A mediated schema can be considered as a clustering of important attributes in source schemas
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
?
Q: SELECT name, hPhone, oPhone FROM Med
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
Probabilistic Mediated Schema
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Our Solution
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Pr=.5
Pr=.5
P-Mappings w.r.t. P-Med-Schema
PM1
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.64
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.16
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.16
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.04
PM2
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.64
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.16
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.16
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.04
Query Answeringw.r.t. P-Mappings & P-Med-Schema
SELECT name, phone, address
FROM Med-S
TupleProbabilit
y
(‘Alice’, ‘123-4567’, ‘123 A Ave.’)
0.34
(‘Alice’, ‘765-4321’, ‘456 B Ave.’)
0.34
(‘Alice’, ‘765-4321’, ‘123 A Ave.’)
0.16
(‘Alice’, ‘123-4567’, ‘456 B Ave.’)
0.16
name hPhone oPhone hAddr oAddr
Alice 123-4567 765-4321 123, A Ave. 456, B Ave
S1
Q
Answers
Expressive Power of P-Med-Schema v.s. P-Mapping
Theorem 1. For one-to-many mappings(p-med-schema + p-mappings)
= (mediated schema + p-mapping)
> (p-med-schema + mappings) Theorem 2. When restricted to one-to-one
mappings:(p-med-schema + p-mappings)
> (mediated schema + p-mapping)
Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn
Output: Single-table mediated schema M
Algorithm1. Remove all infrequent attributes
2. Find similarity between every pair of attributes and construct a weighted graph
3. Remove edges with weight below τ (e.g., τ=.5)
4. Each connected com-ponent is a cluster
S2
S1
name address
email-address
pname home-address
1
.6
.6
.2
Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn
Output: Single-table mediated schema M
Algorithm1. Remove all infrequent attributes
2. Find similarity between every pair of attributes and construct a weighted graph
3. Remove edges with weight below τ (e.g., τ=.5)
4. Each connected com-ponent is a cluster
S2
S1
name address
email-address
pname home-address
1
.6
.6
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
.6
.6
.2
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
.6
.6
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
.6
.6
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
.6
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1 .6
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
Creation: 3) Computing Probabilities Mediated schema M and source S are consistent if no two
attributes of S are grouped into same cluster in M
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
Creation: 3) Computing Probabilities Assign probabilities to each M proportional to the
number of sources it is consistent with.
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
Pr=1/6 Pr=1/6
Pr=1/3 Pr=1/3
Real Example (650 bib sources)
Consolidation of Possible Med-Schemas Why?
Users expect to see a single mediated schema More efficient query processing
Consolidate possible med-schemas Two attrs are in the same cluster only if they are in the
same cluster in each possible med-schema I.e., Keep only certain edges and do clustering
Consolidate p-mappings Convert each p-mapping to a p-mapping to the
consolidated med-schema Assign probabilities accordingly
Probabilistic Mediated Schema
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Consolidated Med-Schema
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Pr=.5
Pr=.5
Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
Consolidated P-Mapping
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.02
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.08
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.08
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.32
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.08
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.08
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.32
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.02
Higher pr for right
correlations
One-to-many mappings
Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Related Work Automatic generation of mediated
schemasFinding a generative model [He&Chang,
2003]
P-Med-Schema
Q
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
Roadmap
1. Probabilistic schema mappings• Query answering w.r.t. pMs [VLDB’07]• Automatic creation [Sigmod’08]
2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation
3. Probabilistic Integrity Constraints [WebDB’09]
• Automatic creation• An application
Functional Dependency Functional dependency:
: a set of attributes : an attributeE.g., SSN name
ApplicationsData cleaningMediated-schema normalization
Normalizing the P-Med-Schema
Probabilistic Functional Dependency Uncertainty
Noisy data false negativeBias false positive Incorrect schema mappings Incorrect instance mappings
Probabilistic functional dependency:
p: likelihood of the FD being correct Apply probabilistic analysis.
Related Work Approximate functional dependencies:
generating approximate FDs in a statistical way from a single data sourceTANE(1999)CORDS(2004)
S2S5S3
S4
Architecture of the UDI System
UDI
S1
Schema
Constraint Discovery
Data
Attribute Matching
Attribute Clustering
Attr sim
Schema Mapping
P-med-schema Attr sim
Schema Consolidation
P-med-schema P-mappings
Normalized schema & P-mappings
Schema Normalization
Consolidated schemaPFDs
Query Rewriting
Query
Query Answering
Rewritten Query
Data
Answers
Mappings
Experimental Settings (I) Data: tables extracted from HTML tables on the web
Domain #Src Search Keywords
Movie 161 movie, year
Car 817 make, model
People 49job/title, organization/company/employer
Course 647course/class, instructor/teacher/lecturer, subject/department/title
Bib 649 author, title, year, journal/conference
Experimental Settings (II) Queries: 10 SQL queries for each domain
SELECT clause contains 1 to 4 attributes WHERE clause contains 0 to 3 predicates
[attr] =|<>|<|<=|>|>=|LIKE [const] Selectivity and likelihood of correct mapping vary
Performance measurePrecision Recall F-measure
#(correct answers) #(correct answers) 2·Precision·Recall #(returned answers) #(real answers) Precision + Recall
Golden standard: Generated by manually creating med-schema and mappings
Approximate standard: Generated by manually removing incorrect answers from those by UDI and by directly querying each data source
Quality of Query Answering
Domain Precision Recall F-measure
Golden Standard
People 1 .849 .918
Course 1 .852 .92
Approximate Golden Standard
Movie .95 1 .924
Car 1 .917 .957
People .958 .984 .971
Course 1 1 1
Bib 1 .955 .977
Comparison w. Other Integration Approaches
Keyword search is limited and obtained both low precision and low recall.
Querying the sources directly or considering only the mapping w. the highest probability obtained low recall.
UDI obtained highest F-measure in all domains.
Comparison w. Other Mediated-Schema Generation Methods
Using p-med-schema obtained highest F-measure in all domains.
Simply unioning all source attributes as the mediated schema obtained a low recallGenerating a single
mediated schema obtained similar F-measures.
P-Med-Schema v.s. Single Mediated Schema
People domain
System Setup Time
Car domain
Conclusions Contributions
Many applications require data integration w. uncertainty
Applying probabilistic models to do this in a principled way:
P-mapping, P-med-schema, P-functional-dependency Build the first completely self-configuring DI system
Future work Use the probabilistic model to improve schema
mappings over time Incorporate uncertainty on keyword reformulation and
dirty data Consider dependence between data sources
Acknowledgement Project participants
Alon Halevy (Google, UW) Anish Das Sarma (Stanford) Daisy Zhe Wang (Berkeley) Cong Yu (Yahoo!)
Other people for discussion Divesh Srivastava, Patrick Haffner (AT&T) Dan Suciu (Univ. of Washington) Jayant Madhavan, Omar Benjelloun, Shawn Jefferey,
etc. (Google)
Data Integration with Uncertainty
Xin (Luna) DongData Management Dept @ AT&T
Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang
(Berkeley), Cong Yu (Yahoo!)