imap: discovering complex semantic matches between database schemas robin dhamankar, yoonkyong lee,...
Post on 16-Jan-2016
214 Views
Preview:
TRANSCRIPT
1
iMAP: Discovering Complex Semantic Matchesbetween Database Schemas
Robin Dhamankar, Yoonkyong Lee, AnHai Doan
University of Illinois, Urbana-Champaign
Alon Halevy, Pedro DomingosUniversity of Washington
2
Semantic Heterogeneity
• Semantics refer to the meaning of data in contrast to syntax, which only defines the structure of the schema items (e.g., classes and attributes).
• Semantic heterogeneity is a general term referring to disagreement about the meaning, interpretation or intended use of the same or related data.
• One of the toughest problems in Data Integration.
3
Schema Matching
• A match between two schemas specifies semantic correspondences between elements of both schemas.
• A complex match specifies that a combination of attributes in one schema corresponds to a combination in the other.
4
Aim• Examine an unbounded number of match
candidates for complex matching between data sources.
• Describe an iMAP system which semi-automatically discovers both 1-1 and complex matches for relational data (database schemas in this paper). The system uses following steps:
o Generating Matcheso Exploiting Domain Knowledgeo Explaining Match Prediction
5
Content• An architecture for semi-automatically discovering
complex matches that combines search through a set of candidate matches and methods for evaluating each match in isolation, and a set of matches as a whole.
• Uses of new kinds of domain knowledge (overlap data and mining external data), and applying the knowledge as early as possible in the matching process.
• A mechanism for explaining the decisions made by the matching system.
• The iMAP system which embodies all these innovations, and a set of experiments on real-world.
6
Problem Definition• consider the two relational schemas S and T in
Figure 1. Both databases store house listings and are managed by two di®erent real-estate companies. The schema of database T, for example, has one table, LISTINGS, whereas database S stores its data in two tables, HOUSES and AGENT.
• Suppose the two real-estate companies have decided to merge. To cut costs, they eliminate database S by transferring all house listings from S to database T.
7
Jean Laup Mike Brown
Semantic Mapping• Specify relationships between schema elements
of disparate data sources.
• Example:
Table S1
32 15
city state
price ($)
Athens GA 360,000 Raleigh
NC 360,000
agent-id
3215
agent commission
0.03 0.04
id
Table S2
locationTable T listed-price
agent-name
Denver, CO
550,000 Laura Smith
Atlanta, GA
370,800 Mike Brown
Creating Semantic Mappings
• Step1: Generating matches
8
• Step2: Elaborating matches
into mappings
location=concat(city,state)listed-price=price*(1+commission)agent-name=agent
location = SELECT concat(city,state) FROM S1
listed-price = SELECT price*(1+commission)
FROM S1, S2 WHERE agent-id = id
agent-name = SELECT agent FROM S2
Table S1
city state
price ($)Athens GA 360,000
Raleigh NC 360,000
agent-id3215
3215 Jean Laup
Mike Brown
agent commission
0.030.04
id
Table S2
location
Target Tlisted-price
agent-name
Denver, CO 550,000 Laura Smith
Atlanta, GA 370,800 Mike Brown
Generating Matches
• 1-1 matchesagent-name=agent
• Complex matcheslocation=concat(city,state)listed-price=price*(1+commission)
• The experiments in this paper contain 30-50% of complex matches.
9
Illustrative Example
10
Finding the best global assigment
location = concat(city,state) listed-price=price*(1+commission) agent-name=agent
Estimating similarity
Generating candidate matches
city state price
agent
commission
Table S
location listed-price
Table T
agent-name
sim(location,concat(city,state))=0.96
sim(listed-price,price(1+commision))=0.97
sim(location,city)=0.76 …
…
sim(listed-price,price)=0.86…
…
sim(agent-name,agent)=0.95
location=concat(city,state)listed-price=price(1+commision)
location=city …
…
listed-price=price…
…
agent-name=agent…
“location” and “listed-price” are unrelated“city” and “price” are unrelated
iMAP Architecture
11
Match selector
1-1 and complex matches
Match candidates
Explanationmodule
User
Domain knowledgeand data
Target schema T and source schema S
Similarity matrix
Similarity estimator
Match generator
Match Generator• The space of candidate matches is huge
12
• Solution: multi-searcher strategy– employing a set of special-purpose searchers– each searcher examines only a specific portion of search
space– e.g., text searcher, numeric searcher, …
• iMAP is highly extensible– new searchers can be added as developed
location=citylocation=stat
e…location=concat(city,state)location=concat(city,agent-
name)…
listed-price=pricelisted-
price=commission…listed-
price=price*(1+commision)listed-price=price*(1-commision)…
location=price*(1+commision)
…location=price*(1-commision)
listed-price=concat(city,state)
…listed-price=concat(city,agent-
name)
13
Default Search Strategy
• Use beam searcho at each search level, keep k best match
candidates
• Example:
concat(city,price) concat(city,agent-name)concat(city,state)
city state
price ($)
commission
Find all concatenation candidate matches for location (k=3)
…
… …
agent-name
Default Search Strategy
• Use beam searcho at each search level, keep k-best match candidates
• Example:
14
concat(city,state)
city state
price ($)
commission
Find all concatenation candidate matches for location (k=3)
…
concat(city,state,price)concat(city,state,agent-name) …
… …
agent-name
concat(city,price) concat(city,agent-name)
Scoring Functions
• Each searcher has its own scoring function.o can use machine learning, statistics, heuristics, etc.
• E.g., current text searcher uses Naive Bayes classification.
• To compute score between location & concat(city,state)o build a Naive Bayes classifier for location.o Use classifier to compute how similar
each instance of concat(city,state) to locationo average similarity value is the desired score.
15
Termination Condition
• Even with beam search, search space can still be huge or infiniteo need to find effective termination condition
• Currently: terminate when see diminishing returns
16
Best candidate match at the 2nd
iterationf(concat(city,state))f(concat(city,state))-| |
<= 0.03thresholdscoring
function
Best candidate match at the 3rd
iteration
Implemented Searchers
Searcher Space of candidates Example matches
Text Text attributes at the source schemaname = concat(first-name, last-name)
NumericUser supplied matches or past complex matches
list-price = price * (1+tax-rate)
Category Attributes with less than t distinct valuesproduct-categories = product-types
Schema mismatch
Source attribute containing target schema infofireplace = 1 if house-desc has “fireplace”
Unit conversion
Physical quantity attributes weight-kg = 2.2* net-weight-pounds
Date Columns recognized as ontology nodesbirth-date = b-day / b-month / b-year
Overlap numeric
Specified by a context free grammarinterest-earned = balance * interest-rate
17
Exploiting Domain Knowledge
• Exploit new kinds of domain knowledge - overlap data and external datao Overlap data: data from different sources that represent
a same entity• tuples from real estate listings that represent same house
o External data: data not in the two input schemas• “# of real estate agents <= 50” mined from real estate
listings
• Use domain knowledge as early as possible
18
Generating Explanations
• Motivating example 2:
19
• Motivating example 1:
num-rooms=bath-rooms + bed-rooms + dining-rooms: 0.434
…
Correct match:num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms
num-rooms=bath-rooms + bed-rooms + living-rooms: 0.432
name=last-name: 0.434name=concat(first-name,last-name): 0.420
…
Correct match:name=concat(first-name,last-name)
Types of Questions Users Can
Ask
• Existence of a matcho why is month-posted=monthly-fee-rate generated?
• Non-existence of a matcho why is num-rooms= bath-rooms+bed-
rooms+dining-rooms+ living-rooms not generated?
• Ranking of a matcho why is name=last-name ranked higher than
name=concat(first-name,last-name)?
20
Generating Explanation
using Dependency Graph
21
list-price=pricelist-price=price(1 + monthly-fee-rate)
list-price = price(1+ monthly-fee-rate)score = 0.76
month-postedis unrelated to
list-price
month-posted = monthly-fee-rate score = 0.67
Match selector
month-posted = monthly-fee-rate score = 0.79
month-posted = monthly-fee-rate score = 0.55
Combining module
month-posted = monthly-fee-rateNaïve Bayes evaluatorName based evaluator
monthly-fee-rate is a monthmonth-posted is a month.
Date searcher
PreprocessorPreprocessor
If data in a column is in (1..12) it’s a month
Assumption Assumption
Source ColumnConstraint Target Column
Match List
Constraint
Candidate
Candidate
Candidate
Candidate
Candidate Candidatelist-price = price
score = 0.63
month-posted
monthly-fee-rate3
2…
910…
Sample Explanation
22
USER: Why num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms does not appear?
iMAP:
(1) Overlap numeric matcher CAN generate bath-rooms + bed-rooms + dining-rooms + living-rooms for num-rooms.
(2) Overlap numeric matcher DID NOT generate it.
(3) A reason: the match has length of 4 terms. overlap numeric searcher has considered only candidates of length up to 3 terms.
(4) Characteristics of the search space for num-rooms: a. Number of considered numeric attributes: 7 b. Considered numeric attributes: building-area lot-dimension1 lot-dimension2 bath-rooms bed-rooms dining-rooms living-rooms c. Used successor functions: Addition Multiplication d. Max. number of terms: 3 e. Max. number of elements in a term: 3
Empirical Evaluation• Domains & data sources
o four domains: Real Estate, Inventory, Cricket, and Financial Wizard
o avg. # of matches: 30o obtain data from several sources
• Internet, Microsoft Access sample databases, students in a large database class
• Methodology and performance measureo run experiments with several configurations on
eight experiment setso top-1 matching accuracy = % of target attributes
whose top-1 matches are correct
23
Complex Matching Accuracy
1-1 matching accuracy: 77-100%
24
0
20
40
60
80
100
Real Estate Inventory Cricket FinancialWizard
Matc
hin
g a
ccu
racy
(%
)
Complex matching accuracy: 50 - 86%
Default system: 33 - 55%+ Domain constraints: 50 - 55%+ Overlap data: 33 - 71%+ Domain constraints + Overlap data: 50 - 86%
Complex Matching Accuracy
for Disjoint Domains
25
Complex matching accuracy: 27 - 58%
0
20
40
60
80
100
Real Estate Inventory Cricket FinancialWizard
Default system: 18 - 58%
+ Domain constraints: 27 - 58%
Matc
hin
g a
ccu
racy
(%
)
Why not 100%?• Hard to find smaller components of a complex
matchagent-address = concat(street-name,city,state), where apt-number is missed
• Hard to remove small noise components froma complex match
phone-number = concat(agent-id,area-code,number),where agent-id is noise (a single digit number)
• Many correct complex matches are in the top three, but not in the top one
26
Summary & Future Work
• Complex matches are pervasiveo finding them is crucial for practical matching solutions
• Described the iMAP solutiono discovers both complex and 1-1 matcheso is highly extensible o exploits several kinds of domain knowledge o can explain produced matches
• Experiments show the promise of iMAP.
• Future worko apply iMAP to other data representations (e.g., XML)o combine iMAP-style techniques with Clio-style user
interaction 27
28
QUESTIONS?
29
top related