![Page 1: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/1.jpg)
AnHai Doan
Database & Data Mining Group
University of Washington, Seattle
Spring 2002
Learning to Map between Learning to Map between Structured Representations of DataStructured Representations of Data
![Page 2: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/2.jpg)
2
New faculty member
Find houses with 2 bedrooms priced under
200K
homes.comrealestate.com homeseekers.com
Data Integration ChallengeData Integration Challenge
![Page 3: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/3.jpg)
3
Architecture of Data Integration SystemArchitecture of Data Integration System
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
source schema 3source schema 1
Find houses with 2 bedrooms priced under 200K
![Page 4: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/4.jpg)
4
price agent-name address
Semantic Mappings between SchemasSemantic Mappings between Schemas
1-1 mapping complex mapping
homes.com listed-price contact-name city state
Mediated-schema
320K Jane Brown Seattle WA240K Mike Smith Miami FL
![Page 5: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/5.jpg)
5
Schema Matching is Ubiquitous!Schema Matching is Ubiquitous! Fundamental problem in numerous applications Databases
– data integration
– data translation
– schema/view integration
– data warehousing
– semantic query processing
– model management
– peer data management
AI– knowledge bases, ontology merging, information gathering agents, ...
Web– e-commerce
– marking up data using ontologies (Semantic Web)
![Page 6: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/6.jpg)
6
Why Schema Matching is DifficultWhy Schema Matching is Difficult
Schema & data never fully capture semantics!– not adequately documented
Must rely on clues in schema & data – using names, structures, types, data values, etc.
Such clues can be unreliable– same names => different entities: area => location or square-feet– different names => same entity: area & address => location
Intended semantics can be subjective– house-style = house-description?
Cannot be fully automated, needs user feedback!
![Page 7: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/7.jpg)
7
Current State of AffairsCurrent State of Affairs Finding semantic mappings is now a key bottleneck!
– largely done by hand– labor intensive & error prone– data integration at GTE [Li&Clifton, 2000]
– 40 databases, 27000 elements, estimated time: 12 years
Will only be exacerbated– data sharing becomes pervasive– translation of legacy data
Need semi-automatic approaches to scale up! Many current research projects
– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,
U of Leipzig, ...– AI: Stanford, Karlsruhe University, NEC Japan, ...
![Page 8: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/8.jpg)
8
Goals and ContributionsGoals and Contributions Vision for schema-matching tools
– learn from previous matching activities– exploit multiple types of information – incorporate domain integrity constraints– handle user feedback
My contributions: solution for semi-automatic schema matching– can match relational schemas, DTDs, ontologies, ...– discovers both 1-1 & complex mappings– highly modular & extensible– achieves high matching accuracy (66 -- 97%) on real-world data
![Page 9: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/9.jpg)
9
Road MapRoad Map
Introduction Schema matching [SIGMOD-01]
– 1-1 mappings for data integration– LSD (Learning Source Description) system
– learns from previous matching activities
– employs multi-strategy learning
– exploits domain constraints & user feedback
Creating complex mappings [Tech. Report-02]
Ontology matching [WWW-02]
Conclusions
![Page 10: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/10.jpg)
10
Suppose user wants to integrate 100 data sources
1. User – manually creates mappings for a few sources, say 3– shows LSD these mappings
2. LSD learns from the mappings
3. LSD predicts mappings for remaining 97 sources
Schema Matching for Data Integration:Schema Matching for Data Integration:the LSD Approachthe LSD Approach
![Page 11: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/11.jpg)
11
price agent-name agent-phone office-phone description
Learning from the Manual Mappings Learning from the Manual Mappings
listed-price contact-name contact-phone office comments
Schema of realestate.com
Mediated schema
$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
listed-price contact-name contact-phone office comments
realestate.com
If “fantastic” & “great” occur frequently in data instances => description
sold-at contact-agent extra-info
$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle
homes.com
If “office” occurs in name => office-phone
![Page 12: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/12.jpg)
12
price agent-name agent-phone office-phone description
Must Exploit Multiple Types of Information! Must Exploit Multiple Types of Information!
listed-price contact-name contact-phone office comments
Schema of realestate.com
Mediated schema
$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
listed-price contact-name contact-phone office comments
realestate.com
If “fantastic” & “great” occur frequently in data instances => description
sold-at contact-agent extra-info
$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle
homes.com
If “office” occurs in name => office-phone
![Page 13: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/13.jpg)
13
Multi-Strategy LearningMulti-Strategy Learning
Use a set of base learners– each exploits well certain types of information
To match a schema element of a new source– apply base learners– combine their predictions using a meta-learner
Meta-learner– uses training sources to measure base learner accuracy– weighs each learner based on its accuracy
![Page 14: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/14.jpg)
14
Base LearnersBase Learners Training
Matching Name Learner
– training: (“location”, address) (“contact name”, name)
– matching: agent-name => (name,0.7),(phone,0.3)
Naive Bayes Learner– training: (“Seattle, WA”,address)
(“250K”,price)
– matching: “Kent, WA” => (address,0.8),(name,0.2)
labels weighted by confidence scoreX
(X1,C1)(X2,C2)...(Xm,Cm)
Observed label
Training examples
Object
Classification model (hypothesis)
![Page 15: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/15.jpg)
15
The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase
Mediated schemaSource schemas
Base-Learner1 Base-Learnerk
Meta-Learner
Training datafor base learners
Hypothesis1 Hypothesisk
Weights for Base Learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Prediction Combiner
Predictions for elements
Predictions for instances
Constraint Handler
Mappings
Domainconstraints
![Page 16: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/16.jpg)
16
Naive Bayes Learner
(“Miami, FL”, address)(“$250K”, price)(“James Smith”, agent-name)(“(305) 729 0831”, agent-phone)(“(305) 616 1822”, office-phone)(“Fantastic house”, description)(“Boston,MA”, address)
Training the Base LearnersTraining the Base Learners
Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic houseBoston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
location price contact-name contact-phone office comments
realestate.com
(“location”, address)(“price”, price)(“contact name”, agent-name)(“contact phone”, agent-phone)(“office”, office-phone)(“comments”, description)
Name Learner
address price agent-name agent-phone office-phone descriptionMediated schema
![Page 17: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/17.jpg)
17
Meta-Learner: StackingMeta-Learner: Stacking[Wolpert 92,Ting&Witten99][Wolpert 92,Ting&Witten99]
Training– uses training data to learn weights
– one for each (base-learner,mediated-schema element) pair
– weight (Name-Learner,address) = 0.2
– weight (Naive-Bayes,address) = 0.8
Matching: combine predictions of base learners– computes weighted average of base-learner confidence scores
Seattle, WAKent, WABend, OR
(address,0.4)(address,0.9)
Name LearnerNaive Bayes
Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8)
area
![Page 18: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/18.jpg)
18
The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase
Mediated schemaSource schemas
Base-Learner1 Base-Learnerk
Meta-Learner
Training datafor base learners
Hypothesis1 Hypothesisk
Weights for Base Learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Prediction Combiner
Predictions for elements
Predictions for instances
Constraint Handler
Mappings
Domainconstraints
![Page 19: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/19.jpg)
19
contact-agent
Applying the LearnersApplying the Learners
Name LearnerNaive Bayes
Prediction-Combiner
(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)
(address,0.6), (description,0.4)
Meta-LearnerName LearnerNaive Bayes
(address,0.7), (description,0.3)
(price,0.9), (agent-phone,0.1)
extra-info
homes.com
Seattle, WAKent, WABend, OR
area
sold-at
(agent-phone,0.9), (description,0.1)
Meta-Learner
area sold-at contact-agent extra-infohomes.com schema
![Page 20: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/20.jpg)
20
Domain ConstraintsDomain Constraints
Encode user knowledge about domain Specified by examining mediated schema Examples
– at most one source-schema element can match address– if a source-schema element matches house-id then it is a key– avg-value(price) > avg-value(num-baths)
Given a mapping combination – can verify if it satisfies a given constraint
area: addresssold-at: price contact-agent: agent-phoneextra-info: address
![Page 21: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/21.jpg)
21
area: (address,0.7), (description,0.3)sold-at: (price,0.9), (agent-phone,0.1)contact-agent: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)
The Constraint HandlerThe Constraint Handler
Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback
– sold-at does not match price
0.30.10.10.40.0012
0.70.90.90.40.2268
Domain Constraints
At most one element matches address
Predictions from Prediction Combiner
area: addresssold-at: price contact-agent: agent-phoneextra-info: description
0.70.90.90.60.3402
area: addresssold-at: price contact-agent: agent-phoneextra-info: address
![Page 22: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/22.jpg)
22
The Current LSD SystemThe Current LSD System Can also handle data in XML format
– matches XML DTDs
Base learners– Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97]
– exploits frequencies of words & symbols– WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98]
– employs information-retrieval similarity metric– Name Learner [SIGMOD-01]
– matches elements based on their names– County-Name Recognizer [SIGMOD-01]
– stores all U.S. county names
– XML Learner [SIGMOD-01]– exploits hierarchical structure of XML data
![Page 23: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/23.jpg)
23
Empirical EvaluationEmpirical Evaluation Four domains
– Real Estate I & II, Course Offerings, Faculty Listings
For each domain– created mediated schema & domain constraints– chose five sources– extracted & converted data into XML– mediated schemas: 14 - 66 elements, source schemas: 13 - 48
Ten runs for each domain, in each run:– manually provided 1-1 mappings for 3 sources– asked LSD to propose mappings for remaining 2 sources
– accuracy = % of 1-1 mappings correctly identified
![Page 24: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/24.jpg)
24
High Matching AccuracyHigh Matching Accuracy
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II CourseOfferings
FacultyListings
LSD’s accuracy: 71 - 92%
Best single base learner: 42 - 72%
+ Meta-learner: + 5 - 22%
+ Constraint handler: + 7 - 13%
+ XML learner: + 0.8 - 6%
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
![Page 25: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/25.jpg)
25
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II Course Offerings Faculty Listings
Contribution of Schema vs. DataContribution of Schema vs. Data
LSD with only schema info.
LSD with only data info.
Complete LSD
Ave
rage
mat
chin
g ac
cura
cy (
%)
More experiments in [Doan et al. SIGMOD-01]
![Page 26: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/26.jpg)
26
LSD SummaryLSD Summary LSD
– learns from previous matching activities– exploits multiple types of information
– by employing multi-strategy learning
– incorporates domain constraints & user feedback– achieves high matching accuracy
LSD focuses on 1-1 mappings Next challenge: discover more complex mappings!
– COMAP (Complex Mapping) system
![Page 27: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/27.jpg)
27
listed-price agent-id full-baths half-baths city zipcode
The COMAP Approach The COMAP Approach
For each mediated-schema element – searches space of all mappings– finds a small set of likely mapping candidates– uses LSD to evaluate them
To search efficiently – employs a specialized searcher for each element type– Text Searcher, Numeric Searcher, Category Searcher, ...
price num-baths address
Mediated-schema
homes.com
320K 53211 2 1 Seattle 98105240K 11578 1 1 Miami 23591
![Page 28: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/28.jpg)
28
The COMAP Architecture [Doan The COMAP Architecture [Doan et al.et al., 02], 02]
Source schema + dataMediated schema
SearcherkSearcher2
Prediction Combiner
Constraint Handler
Mappings
Domainconstraints
Meta-Learner
Base-Learner1 .... Base-Learnerk
Mapping candidates
LSD
Searcher1
![Page 29: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/29.jpg)
29
An Example: Text SearcherAn Example: Text Searcher
Best mapping candidates for address – (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9)
listed-price agent-id full-baths half-baths city zipcode
price num-baths address
Mediated-schema
320K 532a 2 1 Seattle 98105240K 115c 1 1 Miami 23591
homes.com
concat(agent-id,zipcode)
532a 98105115c 23591
concat(city,zipcode)
Seattle 98105Miami 23591
concat(agent-id,city)
532a Seattle115c Miami
Beam search in space of all concatenation mappings Example: find mapping candidates for address
![Page 30: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/30.jpg)
30
Empirical EvaluationEmpirical Evaluation Current COMAP system
– eight searchers
Three real-world domains – in real estate & product inventory– mediated schema: 6 -- 26 elements, source schema: 16 -- 31
Accuracy: 62 -- 97% Sample discovered mappings
– agent-name = concat(first-name,last-name)– area = building-area / 43560– discount-cost = (unit-price * quantity) * (1 - discount)
![Page 31: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/31.jpg)
31
Road MapRoad Map
Introduction Schema matching
– LSD system
Creating complex mappings– COMAP system
Ontology matching – GLUE system
Conclusions
![Page 32: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/32.jpg)
32
Ontology MatchingOntology Matching Increasingly critical for
– knowledge bases, Semantic Web
An ontology – concepts organized into a taxonomy tree– each concept has
– a set of attributes– a set of instances
– relations among concepts
Matching– concepts – attributes – relations
name: Mike Burnsdegree: Ph.D.
Entity
UndergradCourses
GradCourses
People
StaffFaculty
AssistantProfessor
AssociateProfessor
Professor
CS Dept. US
![Page 33: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/33.jpg)
33
Matching Taxonomies of ConceptsMatching Taxonomies of Concepts
Entity
Courses Staff
Technical StaffAcademic Staff
Lecturer Senior Lecturer
Professor
CS Dept. Australia
Entity
UndergradCourses
GradCourses
People
StaffFaculty
AssistantProfessor
AssociateProfessor
Professor
CS Dept. US
![Page 34: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/34.jpg)
34
Constraints in Taxonomy MatchingConstraints in Taxonomy Matching
Domain-dependent – at most one node matches department-chair– a node that matches professor can not be a child of a node
that matches assistant-professor
Domain-independent– two nodes match if parents & children match– if all children of X matches Y, then X also matches Y
– Variations have been exploited in many restricted settings[Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98],[Noy et al., IJCAI-01], [Madhavan et al., VLDB-01]
Challenge: find a general & efficient approach
![Page 35: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/35.jpg)
35
Solution: Relaxation LabelingSolution: Relaxation Labeling Relaxation labeling [Hummel&Zucker, 83]
– applied to graph labeling in vision, NLP, hypertext classification– finds best label assignment, given a set of constraints– starts with initial label assignment– iteratively improves labels, using constraints
Standard relax. labeling not applicable– extended it in many ways [Doan et al., W W W-02]
Experiments– three real-world domains in course catalog & company listings– 30 -- 300 nodes / taxonomy– accuracy 66 -- 97% vs. 52 -- 83% of best base learner – relaxation labeling very fast (under few seconds)
![Page 36: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/36.jpg)
36
Related WorkRelated Work
TRANSCM [Milo&Zohar98]ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01]PROMPT [Noy et al. 00]
SEMINT [Li&Clifton94]ILA [Perkowitz&Etzioni95]DELTA [Clifton et al. 97]
LSD [Doan et al., SIGMOD-01]COMAP [Doan et al. 2002, submitted]GLUE [Doan et al., WWW-02]
CLIO [Miller et. al., 00] [Yan et al. 01]
Single learnerExploit data 1-1 mapping
RulesExploit data1-1 + complex mapping
Hand-crafted rules Exploit schema 1-1 mapping
Learners + rules, use multi-strategy learningExploit schema + data1-1 + complex mappingExploit domain constraints
![Page 37: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/37.jpg)
37
Future WorkFuture Work
Learning source descriptions– formal semantics for mapping– query capabilities, source schema, scope, reliability of data, ...
Dealing with changes in source description Matching objects across sources More sophisticated user feedback
Focus on distributed information management systems– data integration, web-service integration, peer data management– goal: significantly reduce complexity of construction & maintenance
![Page 38: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/38.jpg)
38
ConclusionsConclusions
Efficiently creating semantic mappings is critical
Developed solution for semi-automatic schema matching– learns from previous matching activities– can match relational schemas, DTDs, ontologies, ...– discovers both 1-1 & complex mappings– highly modular & extensible – achieves high matching accuracy
Made contributions to machine learning– developed novel method to classify XML data– extended relaxation labeling
![Page 39: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/39.jpg)
39
Backup SlidesBackup Slides
![Page 40: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/40.jpg)
40
Least-SquaresLinear Regression
Training the Meta-LearnerTraining the Meta-Learner
<location> Miami, FL</><listed-price> $250,000</><area> Seattle, WA </><house-addr>Kent, WA</><num-baths>3</>...
Extracted XML Instances Name Learner
0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ...
Naive Bayes True Predictions
Weight(Name-Learner,address) = 0.1Weight(Naive-Bayes,address) = 0.9
For address
![Page 41: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/41.jpg)
41
Sensitivity to Amount of Available DataSensitivity to Amount of Available Data
40
50
60
70
80
90
100
0 100 200 300 400 500
Ave
rage
mat
chin
g ac
cura
cy (
%)
Number of data listings per source (Real Estate I)
![Page 42: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/42.jpg)
42
Contribution of Each ComponentContribution of Each Component
0
20
40
60
80
100
Real Estate I Course Offerings Faculty Listings Real Estate II
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
Without Name Learner
Without Naive Bayes
Without Whirl Learner
Without Constraint Handler
The complete LSD system
![Page 43: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/43.jpg)
43
Existing learners flatten out all structures
Developed XML learner– similar to the Naive Bayes learner
– input instance = bag of tokens– differs in one crucial aspect
– consider not only text tokens, but also structure tokens
Exploiting Hierarchical Structure Exploiting Hierarchical Structure
<description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors.</description>
<contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm></contact>
![Page 44: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/44.jpg)
44
Reasons for Incorrect MatchingsReasons for Incorrect Matchings Unfamiliarity
– suburb– solution: add a suburb-name recognizer
Insufficient information– correctly identified general type, failed to pinpoint exact type– agent-name phone
Richard Smith (206) 234 5412
– solution: add a proximity learner
Subjectivity– house-style = description?
Victorian Beautiful neo-gothic houseMexican Great location
![Page 45: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/45.jpg)
45
Evaluate Mapping CandidatesEvaluate Mapping Candidates For address, Text Searcher returns
– (agent-id,0.7)– (concat(agent-id,city),0.8)– (concat(city,zipcode),0.75)
Employ multi-strategy learning to evaluate mappings Example: (concat(agent-id,city),0.8)
– Naive Bayes Learner: 0.8– Name Learner: “address” vs. “agent id city” 0.3– Meta-Learner: 0.8 * 0.7 + 0.3 * 0.3 = 0.65
Meta-Learner returns– (agent-id,0.59)– (concat(agent-id,city),0.65)– (concat(city,zipcode),0.70)
![Page 46: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/46.jpg)
46
Relaxation LabelingRelaxation Labeling
Dept U.S. Dept Australia
CoursesCourses Staff People
StaffFacultyTech. StaffAcad. StaffStaff
People
CoursesCourses
Faculty
Applied to similar problems in– vision, NLP, hypertext classification
![Page 47: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/47.jpg)
47
Relaxation Labeling for Taxonomy MatchingRelaxation Labeling for Taxonomy Matching Must define
– neighborhood of a node – k features of neighborhood– how to combine influence of features
–
Algorithm– init: for each pair <N,L>, compute – loop: for each pair <N,L>, re-compute
M
MPMLNPLNP )|().,|()|(
)|( LNP
),...,,|( 21 kfffLNP
Acad. Staff: FacultyTech. Staff: StaffStaff = People
Neighborhood configuration
![Page 48: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data](https://reader034.vdocuments.us/reader034/viewer/2022052701/56649dab5503460f94a9a598/html5/thumbnails/48.jpg)
48
Relaxation Labeling for Taxonomy MatchingRelaxation Labeling for Taxonomy Matching
Huge number of neighborhood configurations!– typically neighborhood = immediate nodes– here neighborhood can be entire graph
100 nodes, 10 labels => configurations
Solution– label abstraction + dynamic programming– guarantee quadratic time for a broad range of domain constraints
Empirical evaluation– GLUE system [Doan et. al., WWW-02]– three real-world domains – 30 -- 300 nodes / taxonomy– high accuracy 66 -- 97% vs. 52 -- 83% of best base learner– relaxation labeling very fast, finished in several seconds
10010