an interactive clustering-based approach to integrating source query interfaces on the deep web
DESCRIPTION
An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. Wensheng Wu 1 , Clement Yu 2 , AnHai Doan 1 , Weiyi Meng 3 1 University of Illinois at Urbana-Champaign 2 University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/1.jpg)
Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3
1 University of Illinois at Urbana-Champaign2 University of Illinois at Chicago
3 SUNY at Binghamton
June 2004, Paris, France
An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the
Deep Web
![Page 2: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/2.jpg)
2
Access Deep Web Sources
united.com airtravel.com
delta.com hotwire.com
![Page 3: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/3.jpg)
3
Global Query Interface
united.com airtravel.com
delta.com hotwire.com
![Page 4: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/4.jpg)
4
Constructing Global Query Interface A unified query interface with these desired features:
Conciseness - Combine semantically
similar fields over source interfaces Completeness - Retain source-specific fields User-friendliness – Highly related fields
are close together
Two-phrased integration Interface MatchingInterface Matching – Identify semantically similar fields
Interface Integration – Merge the source query interfaces
![Page 5: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/5.jpg)
5
Interface Matching – Challenges
Field A in one interface is semantically similar
to field B in another interface, but
have nothing in common. E.g.,
sim(A,B) = sim(A,C), which field should A match? E.g.,
x
x
?
![Page 6: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/6.jpg)
6
Interface Matching – Challenges (Cont’d)
1:m mappings: E.g.,
Determine matching threshold
?
![Page 7: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/7.jpg)
7
Existing Common Limitations
Limitation 1: Non-hierarchical modeling
Limitation 2: Do not handle 1:m mappings or handle them with low accuracy
Limitation 3: Does not allow limited user interactions
Detailed comparisons given in paper …
![Page 8: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/8.jpg)
8
The IceQ’s Approach [SIGMOD-04]
Hierarchical modeling Let’s be out of “flat” land
“Greedy” is good Always start with the most confident matching
Bridging effect “a2” and “c2” might not look similar themselves
but they might both be similar to “b3”
1:m mappings Aggregate and is-a types
User interaction helps in: Interactive learning of matching threshold Resolution of uncertain mappings
X
0.50.8
Pick this!
![Page 9: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/9.jpg)
9
Hierarchical Modeling
Source Query Interface
Ordered Tree Representation
Capture: ordering and grouping of fields
![Page 10: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/10.jpg)
10
Field Similarity Function Each field may have a label, a name and a set of values, e.g.,
Evaluate the similarity sim(A,B) between two fields, A and B, based on:
Linguistic similarity by label similarity, name similarity and name vs. label similarity, each measured by Cosine function
Domain similarity by domain type and domain value similarity
Linguistic similarity
Domain similarity
![Page 11: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/11.jpg)
11
Find 1:1 Mappings via ClusteringInterfaces:
After one merge:
…, final clusters:{{a1,b1,c1}, {b2,c2},{a2},{b3}}
(Threshold = .3)
Initial similarity matrix:
![Page 12: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/12.jpg)
12
“Bridging” Effect
?
A
CB
Observations: - It is difficult to match “vehicle” field, A, with “make” field, B - But A’s instances are similar to C’s, and C’s label is similar to B’s - Thus, C might serve as a “bridge” to connect A and B!
![Page 13: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/13.jpg)
13
“Bridging” Effect (Cont’d)
hotfares.com
airtickets.com
airtravel.com
??
Connections might also be made via labels
![Page 14: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/14.jpg)
14
Field Ordering-based Tie Resolution
A1 A2
B2
B1
Question: sim(A1, B1) = sim(A1, B2), which one should A1 match?
Observation: the ordering of fields conveys semantics!
0.35
0.35
0.35
0.35
![Page 15: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/15.jpg)
15
Complex Mappings
Aggregate type – contents of fields on the many side are part ofthe content of field on the one side
Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics
![Page 16: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/16.jpg)
16
Complex Mappings (Cont’d)
Is-a type – contents of fields on the many side are sum/union ofthe content of field on the one side
Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics
![Page 17: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/17.jpg)
17
Complex Mappings (Cont’d) Final 1-m phase infers new mappings:
Preliminary 1-m phase: a1 (b1, b2)Clustering phase: b1 c1, b2 c2Final 1-m phase: a1 (c1, c2)
![Page 18: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/18.jpg)
18
Active Learning of Thresholds Observation: In an ideal situation,
if field A matches with some field X, then sim(A, X) > threshold T1
if field A does not match with any field, then for any C, max{sim(A, C)} < T2, where T2 < T1
.91
.8
.73
.62
.46
.2
.03
List 1
.87
.82
.6
.53
.5
.33
.28
List 3
.62
.53
.5
.48
.46
.32
.1
List 2
Initial B: [0,.4]
Drop rule: 50%
List1: (1) question on .2, answer yes, update B = [0, .2], continue on list 1 (2) question on .03, answer no, update B = [.03, .2]List2: question on .1, answer yes, update B=[.03, .1]
List3: no values within B
Threshold set to any value between .03 and .1
![Page 19: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/19.jpg)
19
Interactive Resolution of Uncertain Mappings Resolve potential homonyms
Observation: two fields are
possible homonyms if their
labels are highly similar
while domains are not.
Determine potential synonyms Observation: Two fields might still be similar
if there are common values in their
domains even if their label/domain
similarities are low
x
=
X
![Page 20: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/20.jpg)
20
Interactive Resolution of Uncertain Mappings Determine potential 1:m mappings
Observation: A might still match with B and C if (a) sim(A,B) is very close to sim(A,C); (b) B and C are adjacent; and (c) A is the only field in its interface which satisfies (a) and (b)
?
![Page 21: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/21.jpg)
21
Empirical Evaluations
Automatic field matching
Accuracy with learned thresholds
Distribution of questions
Accuracy with all user interactions
![Page 22: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/22.jpg)
22
Comparison of Component Contributions
On average, 12.6% increase in recall
15.4%
7.3%
![Page 23: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/23.jpg)
23
Summary
High accuracy of determining matching fields across multiple user interfaces
Limited use of user interactions
![Page 24: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681540c550346895dc209f6/html5/thumbnails/24.jpg)
24
Future Research
Improve the accuracy of determining matching fields further
Decrease the number of user interactions
Produce unified friendly user interface
Provide such a tool on the Web