![Page 1: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/1.jpg)
Data Integration with Uncertainty
Xin (Luna) DongData Management Dept @ AT&T
Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang
(Berkeley), Cong Yu (Yahoo!)
![Page 2: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/2.jpg)
Many Applications Need to Manage Heterogeneous Data Sources
D1
D2
D3
D4
D5
Bib(title, authors, conf, year)
Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)
![Page 3: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/3.jpg)
Traditional Data Integration Systems [Ullman, 1997]
D1
D2
D3
D4
D5
Mediated Schema
MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid
Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)
Publication(title, author, conf, year)
![Page 4: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/4.jpg)
Querying on Traditional Data Integration Systems
D1
D2
D3
D4
D5
Mediated Schema
QQQ
Q1
Q2Q4
Q
Q5
Q3
![Page 5: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/5.jpg)
Uncertainty Can Occur at Three Levels in Data Integration Applications
D1
D2
D3
D4
D5
Mediated Schema
I. Data Level
II. Semantics Level
III. Query Level
![Page 6: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/6.jpg)
Mediated SchemaP-Med-Schema
Q
Data Integration with Uncertainty
D1
D2
D3
D4
D5
I. Probabilistic Model
PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
II. Keyword QueryReformulationIII. Top-K Query
Answering
![Page 7: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/7.jpg)
Our Approach
Many applications require only approximate answers and do not need full integration.
Pay-as-you-go data integration [Halevy, et al., 2005] Handle uncertainty and provide best-effort services
from the outset Evolve mappings between data sources over time
This talk: build a completely self-configuring data integration system by applying probabilistic models.
![Page 8: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/8.jpg)
P-Med-Schema
Q
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
Technical Contributions
1. Probabilistic schema mappings• Query answering w.r.t. pMs
[VLDB’07]• Automatic creation [Sigmod’08]
2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation
3. Probabilistic Integrity Constraints[WebDB’09]
• Automatic creation• An application
![Page 9: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/9.jpg)
P-Med-Schema
Q
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
Roadmap
1. Probabilistic schema mappings• Query answering w.r.t. pMs
[VLDB’07]• Automatic creation [Sigmod’08]
2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation
3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation• An application
![Page 10: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/10.jpg)
Example Probabilistic Mappings
Mediated Schema
T(name, email, mailing-addr, home-addr, office-addr)
Q: SELECT mailing-addr FROM T
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
PName Email-addr Current-addr Permanent-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale Sunnyvale
S
T(name, email, mailing-addr, home-addr, office-addr) 0.5
S(pname, email-addr, current-addr, permanent-addr)
m1
T(name, email, mailing-addr, home-addr, office-addr) 0.4
S(pname, email-addr, current-addr, permanent-addr)
m2
T(name, email, mailing-addr, home-addr, office-addr) 0.1
S(pname, email-addr, current-addr, permanent-addr)
m3
Tuple Prob
(‘Sunnyvale’)
(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.9
0.5
0.1
0.1
![Page 11: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/11.jpg)
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Mediated Schema
T(name, email, mailing-addr, home-addr, office-addr)
Q: SELECT mailing-addr FROM T
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
PName Email-addr Current-addr Permanent-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale Sunnyvale
S
T(name, email, mailing-addr, home-addr, office-addr) 0.5
S(pname, email-addr, current-addr, permanent-addr)
m1
T(name, email, mailing-addr, home-addr, office-addr) 0.4
S(pname, email-addr, current-addr, permanent-addr)
m2
T(name, email, mailing-addr, home-addr, office-addr) 0.1
S(pname, email-addr, current-addr, permanent-addr)
m3
Tuple Prob
(‘Sunnyvale’)
(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.9
0.5
0.1
0.1
Top-k Query Answering w.r.t. Probabilistic Mappings
Tuple Prob
(‘Sunnyvale’)(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.90.5
0.1
0.1
![Page 12: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/12.jpg)
Schema Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Mappings: one-to-one schema matching Queries: select-project-join queries
![Page 13: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/13.jpg)
Probabilistic Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Possible MappingProbabil
ity{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1
![Page 14: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/14.jpg)
By-Table v.s. By-Tuple Semantics
![Page 15: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/15.jpg)
By-Table v.s. By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
Alice Sunnyvale
Bob Sunnyvale
name
mailing-addr
Alice alice@
Bob bob@ Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1
![Page 16: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/16.jpg)
By-Table v.s. By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob bob@
DT=nam
emailing-
addr
AliceSunnyval
e
Bob bob@
name
mailing-addr
Alice alice@
Bob bob@ Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01
…
![Page 17: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/17.jpg)
By-Table Query Answering
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
AliceSunnyval
e
BobSunnyval
e
name
mailing-addr
Alice alice@
Bob bob@ 0.5 0.4 0.1
SELECT mailing-addr
FROM T
Tuple Probability
(‘Sunnyvale’) 0.9
(‘Mountain View’) 0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
![Page 18: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/18.jpg)
By-Tuple Query Answering
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob bob@
DT=nam
emailing-
addr
AliceSunnyval
e
Bob bob@
name
mailing-addr
Alice alice@
Bob bob@ 0.05 0.04 0.01
Tuple Probability
(‘Sunnyvale’) 0.94
(‘Mountain View’) 0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
…
SELECT mailing-addr
FROM T
![Page 19: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/19.jpg)
Complexity of Query Answering
By-table By-tupleData Complexity PTIME #P-
complete
Mapping Complexity
PTIME PTIME
![Page 20: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/20.jpg)
By-Table Query Answering
SELECT mailing-addr FROM T
SELECT home-addr FROM S (0.5)
SELECT office-addr FROM S (0.4)
SELECT email-addr FROM S (0.1)
Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping
Query Rewriting
![Page 21: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/21.jpg)
By-Tuple Query Answering
pname email-addr home-addr office-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale San Jose
Ds=
DT=
name
mailing-addr
Alice Sunnyvale
Bob Sunnyvale0.2
…
SELECT mailing-addr
FROM T
Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping
Target Enumeration name
mailing-addr
Alice Sunnyvale
Bob @bob0.04
name mailing-addr
Alice alice@
name mailing-addr
AliceMountain
View
Bob @bob0.05
![Page 22: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/22.jpg)
More on By-Tuple Query Answering The high complexity comes from computing
probabilitiesTheorem: Even computing the probability for
one possible answer is #P-hardTheorem: Computing all possible answers w/o
probabilities is in PTIME In general by-tuple query answering cannot be
done by query rewriting There are two subsets of queries that can be
answered in PTIME by query rewriting
![Page 23: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/23.jpg)
PTIME for Queries with a Single P-Mapping Target
Answers
(‘Mountain View’)
(‘Sunnyvale’)
SELECT mailing-addr FROM T
SELECT home-addr FROM S
SELECT office-addr FROM S
SELECT email-addr FROM S
Answers
(‘Sunnyvale’)
(‘Sunnyvale’)
Answers
(‘alice@’)
(‘bob@’)
Pr=0.5 Pr=0.4 Pr=0.1
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
=1-(1-0.4)(1-0.5-0.4)=1-(1-0.5)(1-0)
![Page 24: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/24.jpg)
PTIME for Queries that Return Join Attributes
SELECT mailing-addr FROM T,VWHERE T.mailing-addr = V.hightech
SELECT mailing-addr FROM T
SELECT hightechFROM V
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
=0.94*0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.8
(‘Mountain View’)
0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.752
(‘Mountain View’)
0.4
![Page 25: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/25.jpg)
Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes
SELECT ‘true’ FROM T,VWHERE T.mailing-addr = V.hightech
SELECT mailing-addr FROM T
SELECT hightechFROM V
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
≠0.94*0.8+0.5*0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.8
(‘Mountain View’)
0.8
TupleProbabilit
y
(‘true’) 0.864
![Page 26: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/26.jpg)
Extensions to More Expressive Mappings The complexity results for query answering carry over to
three extensions to more expressive mappings Complex mappings
E.g., address street, city and state GLAV mappings
E.g., Paper Authorship Author Publication Conditional mappings:
E.g., if age>65, Pr(home-addr mailing-addr)=0.8 if age<=65, Pr(home-addr mailing-addr)=0.5
![Page 27: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/27.jpg)
Creation: 1) Matching Attributes Current schema matching techniques can
compute similarity between source attributes and target attributes---Weighted attribute correspondences
Goal: find a p-mapping that is consistent w. a set of weighted correspondences
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.9 0.90.2
![Page 28: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/28.jpg)
Creation: 1) Matching Attributes
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.9 0.90.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
![Page 29: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/29.jpg)
Creation: 2) Normalization Theorem: Let C be a set of weighted
correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.
Approach: Normalization
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.9 0.90.2
![Page 30: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/30.jpg)
Creation: 2) Normalization Theorem: Let C be a set of weighted
correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.
Approach: Normalization
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
![Page 31: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/31.jpg)
Creation: 2) Normalization
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
![Page 32: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/32.jpg)
Creation: 3) Generating p-Mappings However, different p-mappings can be consistent
w. the same set of weighted correspondences.Solution: Choose the p-mapping w. the maximum
entropy.Algorithm:
Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.
Solve the following optimization problem:
![Page 33: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/33.jpg)
Creation: 3) Generating p-Mappings
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
![Page 34: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/34.jpg)
Creation: 3) Generating p-Mappings
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.3
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.5
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.2
![Page 35: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/35.jpg)
Creation: 3) Generating p-Mappings However, different p-mappings can be consistent
w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum
entropy. Algorithm:
Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C.
Solve the following optimization problem:
![Page 36: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/36.jpg)
Creation: 3) Generating p-Mappings
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.5 0.50.2
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.4
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
(num, pname, home-addr, office-addr)
(name, mailing-addr)p=.1
Entropy = 0.52 (v.s. 0.45)
![Page 37: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/37.jpg)
Related Work Approximate schema mappings
Using top-k schema mappings can increase the recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007]
Generating top-k schema mappings by combining results by various matchers [Gal, 2007]
![Page 38: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/38.jpg)
P-Med-Schema
Q
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
Roadmap
1. Probabilistic schema mappings• Query answering w.r.t. pMs [VLDB’07]• Automatic creation [Sigmod’08]
2. Probabilistic mediated schema[Sigmod’08]
• Expressiveness• Automatic creation
3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation• An application
![Page 39: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/39.jpg)
An Example Mediated Schema
S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr)
Med-S (name, email, phone, addr)
{name, person-name}
{phone-num, phone}
{address,mailing-addr}
{email}
A mediated schema can be considered as a clustering of important attributes in source schemas
![Page 40: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/40.jpg)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
?
Q: SELECT name, hPhone, oPhone FROM Med
![Page 41: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/41.jpg)
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
![Page 42: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/42.jpg)
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
![Page 43: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/43.jpg)
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
![Page 44: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/44.jpg)
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
![Page 45: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/45.jpg)
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Why Probabilistic Mediated Schema?
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med
![Page 46: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/46.jpg)
Probabilistic Mediated Schema
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Our Solution
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Pr=.5
Pr=.5
![Page 47: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/47.jpg)
P-Mappings w.r.t. P-Med-Schema
PM1
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.64
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.16
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.16
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.04
PM2
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.64
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.16
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.16
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.04
![Page 48: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/48.jpg)
Query Answeringw.r.t. P-Mappings & P-Med-Schema
SELECT name, phone, address
FROM Med-S
TupleProbabilit
y
(‘Alice’, ‘123-4567’, ‘123 A Ave.’)
0.34
(‘Alice’, ‘765-4321’, ‘456 B Ave.’)
0.34
(‘Alice’, ‘765-4321’, ‘123 A Ave.’)
0.16
(‘Alice’, ‘123-4567’, ‘456 B Ave.’)
0.16
name hPhone oPhone hAddr oAddr
Alice 123-4567 765-4321 123, A Ave. 456, B Ave
S1
Q
Answers
![Page 49: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/49.jpg)
Expressive Power of P-Med-Schema v.s. P-Mapping
Theorem 1. For one-to-many mappings(p-med-schema + p-mappings)
= (mediated schema + p-mapping)
> (p-med-schema + mappings) Theorem 2. When restricted to one-to-one
mappings:(p-med-schema + p-mappings)
> (mediated schema + p-mapping)
![Page 50: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/50.jpg)
Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn
Output: Single-table mediated schema M
Algorithm1. Remove all infrequent attributes
2. Find similarity between every pair of attributes and construct a weighted graph
3. Remove edges with weight below τ (e.g., τ=.5)
4. Each connected com-ponent is a cluster
S2
S1
name address
email-address
pname home-address
1
.6
.6
.2
![Page 51: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/51.jpg)
Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn
Output: Single-table mediated schema M
Algorithm1. Remove all infrequent attributes
2. Find similarity between every pair of attributes and construct a weighted graph
3. Remove edges with weight below τ (e.g., τ=.5)
4. Each connected com-ponent is a cluster
S2
S1
name address
email-address
pname home-address
1
.6
.6
![Page 52: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/52.jpg)
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
.6
.6
.2
![Page 53: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/53.jpg)
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
.6
.6
![Page 54: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/54.jpg)
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
.6
.6
![Page 55: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/55.jpg)
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
.6
![Page 56: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/56.jpg)
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1 .6
![Page 57: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/57.jpg)
Creation: 2) Creating All Possible Med-Schemas Algorithm1. Remove all infrequent attributes2. Find similarity between every pair of attributes and
construct a weighted graph3. For each edge
• (weight ≥ τ+ϵ) retain• (weight < τ-ϵ) drop• (τ-ϵ ≤ weight < τ+ϵ)
uncertain edge(e.g., τ =.6, ϵ = .2)
Clustering for each comboof including/excluding uncertain edges
S2
S1
name address
email-address
pname home-address
1
![Page 58: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/58.jpg)
Creation: 3) Computing Probabilities Mediated schema M and source S are consistent if no two
attributes of S are grouped into same cluster in M
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
![Page 59: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/59.jpg)
Creation: 3) Computing Probabilities Assign probabilities to each M proportional to the
number of sources it is consistent with.
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
Pr=1/6 Pr=1/6
Pr=1/3 Pr=1/3
![Page 60: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/60.jpg)
Real Example (650 bib sources)
![Page 61: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/61.jpg)
Consolidation of Possible Med-Schemas Why?
Users expect to see a single mediated schema More efficient query processing
Consolidate possible med-schemas Two attrs are in the same cluster only if they are in the
same cluster in each possible med-schema I.e., Keep only certain edges and do clustering
Consolidate p-mappings Convert each p-mapping to a p-mapping to the
consolidated med-schema Assign probabilities accordingly
![Page 62: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/62.jpg)
Probabilistic Mediated Schema
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Consolidated Med-Schema
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Pr=.5
Pr=.5
Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
![Page 63: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/63.jpg)
Consolidated P-Mapping
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.02
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.08
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.08
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.32
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.08
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.08
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.32
Med (name, P, hP, oP, A, hA, oA)
S1(name, hP, oP, hA, oA)Pr=.02
Higher pr for right
correlations
One-to-many mappings
Med ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
![Page 64: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/64.jpg)
Related Work Automatic generation of mediated
schemasFinding a generative model [He&Chang,
2003]
![Page 65: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/65.jpg)
P-Med-Schema
Q
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
Roadmap
1. Probabilistic schema mappings• Query answering w.r.t. pMs [VLDB’07]• Automatic creation [Sigmod’08]
2. Probabilistic mediated schema [Sigmod’08]• Expressiveness• Automatic creation
3. Probabilistic Integrity Constraints [WebDB’09]
• Automatic creation• An application
![Page 66: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/66.jpg)
Functional Dependency Functional dependency:
: a set of attributes : an attributeE.g., SSN name
ApplicationsData cleaningMediated-schema normalization
![Page 67: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/67.jpg)
Normalizing the P-Med-Schema
![Page 68: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/68.jpg)
Probabilistic Functional Dependency Uncertainty
Noisy data false negativeBias false positive Incorrect schema mappings Incorrect instance mappings
Probabilistic functional dependency:
p: likelihood of the FD being correct Apply probabilistic analysis.
![Page 69: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/69.jpg)
Related Work Approximate functional dependencies:
generating approximate FDs in a statistical way from a single data sourceTANE(1999)CORDS(2004)
![Page 70: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/70.jpg)
S2S5S3
S4
Architecture of the UDI System
UDI
S1
Schema
Constraint Discovery
Data
Attribute Matching
Attribute Clustering
Attr sim
Schema Mapping
P-med-schema Attr sim
Schema Consolidation
P-med-schema P-mappings
Normalized schema & P-mappings
Schema Normalization
Consolidated schemaPFDs
Query Rewriting
Query
Query Answering
Rewritten Query
Data
Answers
Mappings
![Page 71: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/71.jpg)
Experimental Settings (I) Data: tables extracted from HTML tables on the web
Domain #Src Search Keywords
Movie 161 movie, year
Car 817 make, model
People 49job/title, organization/company/employer
Course 647course/class, instructor/teacher/lecturer, subject/department/title
Bib 649 author, title, year, journal/conference
![Page 72: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/72.jpg)
Experimental Settings (II) Queries: 10 SQL queries for each domain
SELECT clause contains 1 to 4 attributes WHERE clause contains 0 to 3 predicates
[attr] =|<>|<|<=|>|>=|LIKE [const] Selectivity and likelihood of correct mapping vary
Performance measurePrecision Recall F-measure
#(correct answers) #(correct answers) 2·Precision·Recall #(returned answers) #(real answers) Precision + Recall
Golden standard: Generated by manually creating med-schema and mappings
Approximate standard: Generated by manually removing incorrect answers from those by UDI and by directly querying each data source
![Page 73: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/73.jpg)
Quality of Query Answering
Domain Precision Recall F-measure
Golden Standard
People 1 .849 .918
Course 1 .852 .92
Approximate Golden Standard
Movie .95 1 .924
Car 1 .917 .957
People .958 .984 .971
Course 1 1 1
Bib 1 .955 .977
![Page 74: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/74.jpg)
Comparison w. Other Integration Approaches
Keyword search is limited and obtained both low precision and low recall.
Querying the sources directly or considering only the mapping w. the highest probability obtained low recall.
UDI obtained highest F-measure in all domains.
![Page 75: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/75.jpg)
Comparison w. Other Mediated-Schema Generation Methods
Using p-med-schema obtained highest F-measure in all domains.
Simply unioning all source attributes as the mediated schema obtained a low recallGenerating a single
mediated schema obtained similar F-measures.
![Page 76: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/76.jpg)
P-Med-Schema v.s. Single Mediated Schema
People domain
![Page 77: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/77.jpg)
System Setup Time
Car domain
![Page 78: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/78.jpg)
Conclusions Contributions
Many applications require data integration w. uncertainty
Applying probabilistic models to do this in a principled way:
P-mapping, P-med-schema, P-functional-dependency Build the first completely self-configuring DI system
Future work Use the probabilistic model to improve schema
mappings over time Incorporate uncertainty on keyword reformulation and
dirty data Consider dependence between data sources
![Page 79: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/79.jpg)
Acknowledgement Project participants
Alon Halevy (Google, UW) Anish Das Sarma (Stanford) Daisy Zhe Wang (Berkeley) Cong Yu (Yahoo!)
Other people for discussion Divesh Srivastava, Patrick Haffner (AT&T) Dan Suciu (Univ. of Washington) Jayant Madhavan, Omar Benjelloun, Shawn Jefferey,
etc. (Google)
![Page 80: Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma](https://reader037.vdocuments.us/reader037/viewer/2022110207/56649d215503460f949f6c86/html5/thumbnails/80.jpg)
Data Integration with Uncertainty
Xin (Luna) DongData Management Dept @ AT&T
Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang
(Berkeley), Cong Yu (Yahoo!)