solomon: seeking the truth via copying detection
DESCRIPTION
Solomon: Seeking the Truth Via Copying Detection. Xin Luna Dong AT&T Labs-Research 8/2011. We Live in an Information Era. A visualization of the topology of a portion of the Internet. Web 2.0. But the Freely Accessible Information Has Its Downside. - PowerPoint PPT PresentationTRANSCRIPT
SOLOMON: SEEKING THE TRUTH VIA COPYING
DETECTION
Xin Luna DongAT&T Labs-Research
8/2011
We Live in an Information Era
A visualization of the topology of a portion of the Internet. Web 2.0
But the Freely Accessible Information Has Its Downside
Information Propagation Becomes Much Easier with the Web Technologies
False Information Can Be Propagated (I)UA’s bankruptcyChicago Tribune,
2002
Sun-Sentinel.com
Google News
Bloomberg.com
The UAL stock plummeted to $3
from $12.5
False Information Can Be Propagated (II)
Maurice Jarre (1924-2009) French Conductor and Composer
“One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.”
2:29, 30 March 2009
False Information Can Be Propagated (III)
“[Please spread the word] From my friend living in Chiba Prefecture. The weather forecast says it will rain from Monday. People living around Chiba, please be careful. The explosion at the Cosmo oil refinery will cause harmful substance to rise to clouds and become toxic rain. So when you go out, take your umbrella or raincoat, and make sure the rain doesn’t touch your body!”
“The creator of Pokemon died today in the #tsunami, #Japan. RIP: Satoshi Tajiri. #prayforjapan.” By xCyrusAndLovato “The Creator of Hello Kitty, Yuko Yamaguchi, died today in Japan. #prayforjapan”
Relief aid from individualsIn order to avoid confusion, we ask that you please refrain [from distributing relief supplies].Chain letters with specific bank account
information for donations are getting sent around. Please Help Japan! Earthquake Weapons caused
Tsunami
Numerous rumors after the Japan earthquake and tsunami
False Information Can Be Propagated (IV)
Posted by Andrew BreitbartIn his blog
…
The Internet needs a way to help people separate rumor from real science.
– Tim Berners-Lee
We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama
Copying Can Happen on Structured Data (Copying of Weather Data)
Copying Can Be Large Scaled (Copying of AbeBooks Data)
Data collected from AbeBooks[Yin et al., 2007]
Intuitively Meaningful Clusters According to the Copying Relationships
Intuitively Meaningful Clusters According to the Copying Relationships
Copying Can Be Large Scaled (Copying of AbeBooks Data)
SolomonGoal
Discover copying relationships between structured data sources
Leverage the copying relationships to improve various components of data integration
Other applicationsBusiness purpose: data are valuableIn-depth data analysis: information
dissemination
Solomon
Outline
Copying discovery• Local detection
[VLDB’09a]• Global detection
[VLDB’10a]• Detection w.
dynamic data [VLDB’09b]
Applications in data integration• Truth discovery
[VLDB’09a][VLDB’09b]
• Query answering [VLDB’11][EDBT’11]
• Record linkage [VLDB’10b]
Visualization and decision explanation• Visualization• Decision
explanation[VLDB’10 demo]
Problem Definition—Input
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
Missing values
Different formats
Incorrectvalues
Objects: a real-world entity, described by a set of attributes Each associated w. a true value
Sources: each providing data for a subset of objectsInpu
t
Formatting Patterns for Author List
Problem Definition—OutputFor each S1, S2, decide pr of S1 copying directly from S2
A copier copies all or a subset of data A copier can add values and verify/modify copied values—independent
contribution A copier can re-format copied values—still considered as copied
S1 S2
S3
S4
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
Sharing data may be due to both sources providing accurate dataA copier can copy only a small fraction of dataWith only a snapshot it is hard to decide which source is a copierCopying relationship can be complex: co-copying, transitive copying
S1 S2
S3
S4
Challenges in Copying Detection
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
High-Level Intuitions for Copying Detection
Intuition I: decide dependence (w/o direction)
For shared data, Pr(Ф(S1)|S1S2) is low e.g., incorrect value
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
Dependence?Source 1 on USA Presidents:1st : George Washington2nd : John Adams3rd : Thomas Jefferson4th : James Madison…41st : George H.W. Bush42nd : William J. Clinton43rd : George W. Bush44th: Barack Obama
Source 2 on USA Presidents:1st : George Washington2nd : John Adams3rd : Thomas Jefferson4th : James Madison…41st : George H.W. Bush42nd : William J. Clinton43rd : George W. Bush44th: Barack Obama
Are Source 1 and Source 2 dependent?
Not necessarily
Dependence? Source 1 on USA Presidents:1st : George Washington2nd : Benjamin Franklin3rd : Tom Jefferson4th : Abraham Lincoln …41st : George W. Bush42nd : Hillary Clinton43rd : Mickey Mouse44th: Barack Obama
Source 2 on USA Presidents:1st : George Washington2nd : Benjamin Franklin3rd : Tom Jefferson4th : Abraham Lincoln …41st : George W. Bush42nd : Hillary Clinton43rd : Mickey Mouse44th: John McCain
Are Source 1 and Source 2 dependent?
-- Common Errors Very likely
High-Level Intuitions for Copying Detection
Intuition I: decide dependence (w/o direction)
For shared data, Pr(Ф(S1)|S1S2) is low e.g., incorrect data
Intuition II: decide copying directionLet F be a property function of the data
(e.g., accuracy of data)|F(Ф(S1) Ф(S2))-F(Ф(S1)-Ф(S2))|
> |F(Ф(S1) Ф(S2))-F(Ф(S2)-Ф(S1))| .
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
Dependence? Source 2 on USA Presidents:1st : George Washington2nd : Benjamin Franklin3rd : Tom Jefferson4th : Abraham Lincoln …41st : Hillary Clinton42nd : William J. Clinton43rd : Mickey Mouse44th: John McCain
Are Source 1 and Source 2 dependent?
-- Different Accuracy
Source 1 on USA Presidents:1st : George Washington2nd : John Adams3rd : Thomas Jefferson4th : Abraham Lincoln …41st : George W. Bush42nd : William J. Clinton43rd : George W. Bush44th: John McCain
S2 more likely to be a copier
Dependence? Source 2 on USA Presidents:1st : George Washington2nd : Benjamin Franklin3rd : Tom Jefferson4th : Abraham Lincoln …41st : George W. Bush42nd : Hillary Clinton43rd : Mickey Mouse44th: John McCain
Are Source 1 and Source 2 dependent?
-- Different Accuracy
Source 1 on USA Presidents:1st : George Washington2nd : John Adams3rd : Thomas Jefferson4th : Abraham Lincoln…41st : George W. Bush42nd : Hillary Clinton43rd : George W. Bush44th: John McCain
S1 more likely to be a copier
Bayesian Analysis – BasicDifferent Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
Observation: ФGoal: Pr(S1S2| Ф), Pr(S1S2| Ф) (sum up to 1)According to the Bayes Rule, we need to know
Pr(Ф|S1S2), Pr(Ф|S1S2)Key: computing Pr(ФO.A|S1S2), Pr(ФO.A|S1S2)
for each O.AS1 S2
Bayesian Analysis – Probability Computation
Pr Independence Copying
O.At
O.Af
O.Ad
nnn
22
21
n
Pd2
211
)1(11 2 cc
)1(2
cn
c
)1( cPd
ε-error rate; n-#wrong-values; c-copy rate
>
Different Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
Considering Source Accuracy
Pr Independence S1 Copies S2 S2 Copies S1
O.At
O.Af
O.Ad
nSSPf 21
ftd PPP 1
)1(1 1 cPcS t
)1(1 cPcS f
)1( cPd
21 11 SSPt )1(1 2 cPcS t
)1(2 cPcS f
)1( cPd
≠≠
Different Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
Correctness of Data as Evidence for Copying
S1 S2
S3
S4
Extending the Basic Technique
Consider correctness
of data [VLDB’09a]
Consider additional evidence
[VLDB’10a]
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
Formatting as Evidence for Copying
S1 S2
S3
S4
Different formats
SubValues
Extending the Basic Technique
Consider correctness
of data [VLDB’09a]
Consider additional evidence
[VLDB’10a]
Consider correlated copying
[VLDB’10a]
Correlated CopyingK A1 A2 A3 A4
O1 S S S D DO2 S D S S DO3 S S D S DO4 S S S D SO5 S D S S S
K A1 A2 A3 A4
O1 S S S S SO2 S S S S SO3 S S S S SO4 S D D D DO5 S D D D D
17 same values, and 8 different values17 same values, and 8 different values
CopyingS: Two sources providing the same valueD: Two sources providing different values
Extending the Basic Technique
Local Detection
Global Detection
[VLDB’10a]
Consider correctness
of data [VLDB’09a]
Consider additional evidence
[VLDB’10a]
Consider correlated copying
[VLDB’10a]
Consider updates [VLDB’09b]
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
Local copying detection results
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying - Looking at the copying probabilities?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
1
X Looking at the copying probabilities? - Counting shared values?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
1
1
1 1
1
1 1
1
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
50
X Looking at the copying probabilities?X Counting shared values? - Comparing the set of shared values?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
50
30
50 50
30
50 50
30
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
V1-V50
V101-V130
X Looking at the copying probabilities?X Counting shared values? - Comparing the set of shared values?
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V50, V81-V100{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
V1-V50
V101-V130
X Looking at the copying probabilities?X Counting shared values?X Comparing the set of shared values?
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V50, V80-V100{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
V21-V50 shared by 3 sources
We need to reason for each data item in a principled way!
Global Copying Detection1. Find a set of copyings R that significantly influence
the rest of the copyings Maximize
Finding R is NP-complete We propose a fast greedy algorithm
2. Adjust copying probability for the rest of the copyings: P(S1S2|R)
Replace Pr(ФO.A(S1)|S1S2) everywhere with Pr(ФO.A (S1)|S1S2, R), which considers sources that S1 copies from according to R and provide the same value on O.A as S1
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
V1-V50
V101-V130
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V50, V81-V100{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
R={S3S1}, Pr(Ф(S3))= Pr(Ф(S3)|R) for V101-V130
R={S3S1}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50
R={S3S2}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50Pr(Ф(S3)) is high for V81-V100
X X
?
? ?
Experiment Setup18 weather websitesfor 30 major USA citiescollected every 45 minutes for a day33 collections, so 990 objects28 distinct attributes in total
18 weather websitesfor 30 major USA citiescollected every 45 minutes for a day33 collections, so 990 objects28 distinct attributes in total
Silver Standard
Experiment ResultsMeasure: Precision, Recall, F-measure
C: real copying; D: detected copying
RPPRF
CDC
RDDC
P
2,,
Methods Precision
Recall
F-measur
eCorr (Only correctness) .5 .43 .46
Enriched (More evidence) 1 .14 .25
Local (correlated copying) .33 .86 .48
Global (global detection) .79 .79 .79
Transitive/co-copying not removed
Ignoring evidence from
correlated copying
Enriched improves over Corr when true/false notion
does apply
Solomon
Outline
Copying discovery• Local detection
[VLDB’09a]• Global detection
[VLDB’10a]• Detection w.
dynamic data [VLDB’09b]
Applications in data integration• Truth discovery
[VLDB’09a][VLDB’09b]
• Query answering [VLDB’11][EDBT’11]
• Record linkage [VLDB’10b]
Visualization and decision explanation• Visualization• Decision
explanation[VLDB’10 demo]
Data Integration Faces 3 Challenges
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
Data Integration Faces 3 Challenges
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
Data Integration Faces 3 Challenges
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
Scissors
Paper Scissors
Data Integration Faces 3 Challenges
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
Scissors
Glue
Existing Solutions Assume Independence of Data Sources
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
•Schema matching•Model management•Query answering using views•Information extraction
•String matching (edit distance, token-based, etc.)•Object matching (aka. record linkage, reference reconciliation, …)
•Data fusion•Truth discovery
Assume INDEPENDENCEof data sources
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
Source Copying Adds A New Dimension to Data Integration
• Truth discovery [VLDB’09a, VLDB’09b]
• Online data fusion [VLDB’11]
• Integrating probabilistic data
Data Fusion
• Improve record linkage• Distinguish bet wrong
values and alter representations [VLDB’10b]
Record Linkage
• Query optimization [EDBT’11]
• Improve schema matching
Query Answeri
ng
• Recommend trustworthy, up-to-date, and independent sources
Source Recom-mendati
on
S1 S2 S3Stonebrak
erMIT Berkel
eyMIT
Dewitt MSR MSR UWiscBernstein MSR MSR MSR
Carey UCI AT&T BEAHalevy Google Google UW
Application I. Truth Discovery—Naïve Voting
Application I. Truth Discovery—Naïve Voting
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
UCI AT&T
BEA
Truth Discovery(1-.99*.8=.2)
(.22)
S1
S2
S4
S3
S5
.87 .2.2
.99
.99.99
S1 S2
S3
S4 S5Round 1
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.14
.49.49
.49.08
.49.49.49
AT&T
BEA
Truth Discovery
S2
S3
S4 S5
UCIS1
Round 2
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.12
.49.49
.49.06
.49.49.49
AT&T
BEA
Truth Discovery
S2
S3
S4 S5
UCI
S1
Round 3
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.10
.48.49
.50.05
.49.48.50
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 4
S3
S4 S5
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 5
S3
S4 S5
S1
S2
S4
S3
S5
.09
.47.49
.51.04
.49.47.51
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 13
S3
S4 S5
S1
S2
S4
S3
S5
.55.49
.55.49.44.44
Application I. Truth Discovery (Con’t)
Truth Discovery
Source-accuracy
ComputationCopying
DetectionStep 1Step 3
Step 2
Theorem: w/o accuracy, converges Observation: w. accuracy, converges when #objs >> #srcs
Application II. QA & Online Data FusionWhere is AT&T Shannon
Research Labs?
[VLDB’11]
Application II. QA & Online Data FusionWhere is AT&T Shannon
Research Labs?
[VLDB’11]
Application II. QA & Online Data FusionWhere is AT&T Shannon
Research Labs?
[VLDB’11]
Application II. QA & Online Data FusionWhere is AT&T Shannon
Research Labs?
[VLDB’11]
Application II. QA & Online Data FusionWhere is AT&T Shannon
Research Labs?
[VLDB’11]
Application II. QA & Online Data FusionWhere is AT&T Shannon
Research Labs?
[VLDB’11]
Application II. QA & Online Data FusionWhere is AT&T Shannon
Research Labs?
[VLDB’11]
Application II. QA & Online Data FusionWhere is AT&T Shannon
Research Labs?
[VLDB’11]
Quickly find answers
Computing probabilities
Source ordering
Solomon
Outline
Copying discovery• Local detection
[VLDB’09a]• Global detection
[VLDB’10a]• Detection w.
dynamic data [VLDB’09b]
Applications in data integration• Truth discovery
[VLDB’09a][VLDB’09b]
• Query answering [EDBT’11]
• Record linkage [VLDB’10b]
Visualization and decision explanation• Visualization• Decision
explanation[VLDB’10 demo]
Copying of AbeBooks DataAbeBooks data set:
877 bookstores, 1265 CS books, 24364 listings Copying between 465 pairs of sources
Demo Here
Related WorkCopying detection [Sigmod’11 Tutorial]
TextsProgramsImages/VideosStructured sources
Data provenance [Buneman et al., PODS’08]
Focus on effective presentation and retrieval
Assume knowledge of provenance/lineage
Take-AwaysCopying is common on the WebCopying can be detected using statistical approachesKnowing the copying relationship can benefit various aspects of data integration
AcknowledgementsKen Lyons(AT&T Research)
Divesh Srivastava(AT&T Research)
Alon Halevy(Google)
Yifan Hu(AT&T Research)
Remi Zajac(AT&T Research)
Songtao Guo(AT&T Interactive)
Laure Berti-Equille(Institute of Research for Development)
Xuan Liu(Singapore National Univ.)
Xian Li(SUNY Binhamton)
Amelie Marian(Rutgers Univ.)
Anish Das Sarma(Google)
Beng Chin Ooi(Singapore National Univ.)
Ordered by the amount of time spent at AT&T
SOLOMON: SEEKING THE TRUTH VIA COPYING
DETECTION
http://www2.research.att.com/~yifanhu/SourceCopying/
What Is Missing? (a.k.a. Future Work)
Local Detection
Global Detection
Loop copying Copying by category Summarizing copying
patterns Exploring evidence from
schemas, tuple ordering, etc.
Scalability Detecting opinion
influence
Hidden Sources Global detection
for dynamic data
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
What is Missing (a.k.a. Future Work)
• Truth discovery [VLDB’09a, VLDB’09b]
• Integrating probabilistic data
Data Fusion
• Improve record linkage• Distinguish bet wrong
values and alter representations [VLDB’10b]
Record Linkage
• Query optimization [Submitted]
• Improve schema matching
Query Answeri
ng
• Recommend trustworthy, up-to-date, and independent sources
Source Recom-mendati
on
Future Work: Explaining Copying-Detection DecisionsProvide the simplest, understandable explanation for Bayesian analysis
A copying detection decision is complexWhy copying?Why a particular copying pattern (per-object copying vs. per-attribute
copying)?Why a particular copying direction?Why the local decision is different from the global decision?
Answer “what-if” questions What if the two sources actually use the same format for those
common values? What if there is a hidden source that S1 and S2 both copy
from?Answer “comparison” questions
Why S1 is a copier of S2 but not a copier of S3? Why S1 has copied attributes “title” but not “authors”?
Experiment on Static Data [VLDB’09a]Dataset: AbeBooks
877 bookstores1265 CS books24364 listings, w. ISBN, name, author-listAfter pre-cleaning, each book on avg has 19
listings and 4 author lists (ranges from 1-23)Golden standard: 100 random books
Manually check author list from book coverMeasure: Precision=#(Corr author lists)/#(All lists)
Naïve Voting and Types of ErrorsNaïve voting has precision .71
Error type NumMissing authors 23
Additional authors 4Mis-ordering 3Mis-spelling 2
Incomplete names 2
Contributions of Various Components
Methods Prec #Rnds
Time(s)
Naïve .71 1 .2Only value similarity .74 1 .2
Only source accuracy .79 23 1.1
Only source copying .83 3 28.3Copy+accu .87 22 185.8
Copy+accu+sim .89 18 197.5Precision improves by 25.4% over Naïve
Considering copying improves the results most
Reasonably fast
Experiment on Dynamic Data [VLDB’09b]Dataset: Manhattan restaurants
Data crawled from 12 restaurant websites8 versions: weekly from 1/22/2009 to 3/12/20095269 restaurants, 5231 appearing in the first
crawling and 5251 in the last crawling467 restaurants deleted from some websites,
280 closed before 3/15/2009 (Golden standard)Measure: Precision, Recall, F-measure
G: really closed restaurants; D: detected closed restaurants
RPPRF
GDG
RDDG
P
2,,
Between 12 out of 66 pairs copying is likely
Discovered Copying
Contributions of Various Components
Method
Ever-existing Closed #Rn
dsTime(
s)#Rest Prec Rec F-msr
ALL - .60 1.0 .75 - -ALL2 - .94 .34 .50 - -Naïve 1192 .70 .93 .80 1 158
Quality 5068 .83 .88 .85 7 637CopyQu
a 5186 .86 .87 .86 6 1408Google - .84 .19 .30 - -Quality and CopyQua
obtain high precision and recall
Applying rules is inadequate
Naïve missed a lot of restaurants
Google Map listed a lot of out-of-business restaurants